NVMe triggers IOMMU faults on TX2
Hi, we are operating a cluster of 4 TX2's, each node is of same composition: TX2 with nvidia ITX devkit board + our PLX switch (8624) + NVMe SSD (Samsung 960 Pro). There is also our FPGA based camera attached to PLX which captures video through V4L2 drivers and a user-space application makes a file per each frame. Total data rate is 680 MB/s. When we start recording the frames, in random interval, the NVMe driver is kicked out as a response to the IOMMU protection violation. This happens only when concurrent V4L2 capture and NVMe WRITE is happening. If the NVMe is only read, there is no fault happening. Also I could not trigger it with write only when camera was stopped - likely the iommu is not so busy at that time. When running on old TX1 node (with 3.10. kernel), everything works without issues for hours of recording (12 min, then erase and do again). This appears on both R27.1 and R28.1 with the latest L4T kernel. The faults from various runs at various nodes are all similar: [code][ 362.090058] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x83829f00, fsynr=0x60003, cb=22, sid=17(0x11 - AFI), pgd=270064003, pud=270064003, pmd=1cf0c9003, pte=0 [ 288.945693] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x842a1e80, fsynr=0x20003, cb=22, sid=17(0x11 - AFI), pgd=270061003, pud=270061003, pmd=2170ae003, pte=0 [ 252.095736] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x839fbe00, fsynr=0x60003, cb=22, sid=17(0x11 - AFI), pgd=2754c5003, pud=2754c5003, pmd=1cc4d8003, pte=0 [ 247.646129] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x8185ff80, fsynr=0x1a0003, cb=22, sid=17(0x11 - AFI), pgd=270068003, pud=270068003, pmd=21bc7d003, pte=0 [ 494.143285] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x8398fe80, fsynr=0x240003, cb=22, sid=17(0x11 - AFI), pgd=27005f003, pud=27005f003, pmd=1cd530003, pte=0 [ 533.487312] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x892e3880, fsynr=0x240003, cb=22, sid=17(0x11 - AFI), pgd=270060003, pud=270060003, pmd=1d3b23003, pte=0 [ 143.097750] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x840d5e80, fsynr=0x240003, cb=22, sid=17(0x11 - AFI), pgd=270077003, pud=270077003, pmd=21c185003, pte=0 [ 262.925572] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x83f3be00, fsynr=0x80003, cb=21, sid=17(0x11 - AFI), pgd=26f87c003, pud=26f87c003, pmd=21ab5e003, pte=0 [ 279.781618] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x83f3be80, fsynr=0x80003, cb=21, sid=17(0x11 - AFI), pgd=26f87c003, pud=26f87c003, pmd=21ab5e003, pte=0 [ 4710.237528] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x83473f80, fsynr=0x240003, cb=22, sid=17(0x11 - AFI), pgd=270068003, pud=270068003, pmd=20ec53003, pte=0 [/code] This is how the NVMe gets lost / kicked out - and then no longer listed in lspci either. Swapping the CPU module in this system for TX1 and it works perfect. [code][ 142.761669] nvme 0000:03:00.0: Failed status: 3, reset controller [ 142.827824] nvme 0000:03:00.0: Cancelling I/O 806 QID 4 [ 143.097750] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x840d5e80, fsynr=0x240003, cb=22, [ 143.404216] irq 55: nobody cared (try booting with the "irqpoll" option) [ 143.477523] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W O 4.4.38+ #5 [ 143.563966] Hardware name: quill (DT) [ 143.607710] Call trace: [ 143.636880] [<ffffffc000089a10>] dump_backtrace+0x0/0xe8 [ 143.700411] [<ffffffc000089b0c>] show_stack+0x14/0x20 [ 143.760825] [<ffffffc0003bfbd0>] dump_stack+0xa0/0xc8 [ 143.821235] [<ffffffc0000f6a38>] __report_bad_irq+0x38/0xe8 [ 143.887900] [<ffffffc0000f6dc8>] note_interrupt+0x210/0x2f0 [ 143.954573] [<ffffffc0000f4204>] handle_irq_event_percpu+0x224/0x2a0 [ 144.030623] [<ffffffc0000f42c8>] handle_irq_event+0x48/0x78 [ 144.097249] [<ffffffc0000f7860>] handle_fasteoi_irq+0xb8/0x1b0 [ 144.167033] [<ffffffc0000f367c>] generic_handle_irq+0x24/0x38 [ 144.235777] [<ffffffc0000f397c>] __handle_domain_irq+0x5c/0xb8 [ 144.305564] [<ffffffc0000815b8>] gic_handle_irq+0x68/0xf0 [ 144.370143] [<ffffffc000084740>] el1_irq+0x80/0xf8 [ 144.427429] [<ffffffc0000a70d0>] irq_exit+0x88/0xe0 [ 144.485752] [<ffffffc0000f3980>] __handle_domain_irq+0x60/0xb8 [ 144.555541] [<ffffffc0000815b8>] gic_handle_irq+0x68/0xf0 [ 144.620119] [<ffffffc000084740>] el1_irq+0x80/0xf8 [ 144.677405] [<ffffffc0000c7fc0>] finish_task_switch+0xa8/0x1f8 [ 144.747207] [<ffffffc000c13b4c>] __schedule+0x274/0x7a0 [ 144.749672] nvme 0000:03:00.0: Failed status: 3, reset controller [ 144.749720] nvme 0000:03:00.0: Cancelling I/O 1 QID 0 [ 144.943019] [<ffffffc000c140bc>] schedule+0x44/0xb8 [ 145.001350] [<ffffffc000c14588>] schedule_preempt_disabled+0x20/0x40 [ 145.077369] [<ffffffc0000e340c>] cpu_startup_entry+0xfc/0x340 [ 145.146111] [<ffffffc000c12150>] rest_init+0x88/0x98 [ 145.205482] [<ffffffc001167978>] start_kernel+0x39c/0x3b0 [ 145.270062] [<0000000080c19000>] 0x80c19000 [ 145.320051] handlers: [ 145.347133] [<ffffffc0009679d8>] tegra_mcerr_hard_irq threaded [<ffffffc000967a20>] tegra_mcerr_threa [ 145.465523] Disabling IRQ #55 [ 145.494185] (255) csr_afir: EMEM address decode error [ 145.554470] status = 0x2032700e; addr = 0x3ffffffc0 [ 145.554595] nvme 0000:03:00.0: Device failed to resume [ 145.554673] blk_update_request: I/O error, dev nvme0n1, sector 452417536 [ 145.554763] Aborting journal on device nvme0n1-8. [ 145.554768] Buffer I/O error on dev nvme0n1, logical block 62423040, lost sync page write [ 145.554770] JBD2: Error -5 detected when updating journal superblock for nvme0n1-8. [ 145.554796] Buffer I/O error on dev nvme0n1, logical block 0, lost sync page write [ 145.554800] EXT4-fs error (device nvme0n1): ext4_journal_check_start:56: Detected aborted journal [ 145.554803] EXT4-fs (nvme0n1): Remounting filesystem read-only [ 145.554804] EXT4-fs (nvme0n1): previous I/O error to superblock detected [ 145.554807] Buffer I/O error on dev nvme0n1, logical block 0, lost sync page write [ 145.554982] Buffer I/O error on dev nvme0n1, logical block 1, lost async page write [ 145.554988] Buffer I/O error on dev nvme0n1, logical block 1041, lost async page write [ 145.554992] Buffer I/O error on dev nvme0n1, logical block 1057, lost async page write [ 145.554996] Buffer I/O error on dev nvme0n1, logical block 9249, lost async page write [ 146.815802] secure: yes, access-type: read [ 146.869748] Trying to vfree() nonexistent vm area (ffffff8000378000) [ 146.942890] ------------[ cut here ]------------ [ 146.998063] WARNING: at ffffffc0001b0560 [verbose debug info unavailable] [ 147.079300] Modules linked in: bridge stp llc imx183(O) sdma(O) snd_soc_spdif_tx snd_soc_core snd_com [ 147.279819] CPU: 4 PID: 2171 Comm: nvme0 Tainted: G W O 4.4.38 [ 147.359019] Hardware name: quill (DT) [ 147.402762] task: ffffffc0610d3e80 ti: ffffffc1f0138000 task.ti: ffffffc1 [ 147.492344] PC is at __vunmap+0xe0/0xe8 [ 147.538185] LR is at __vunmap+0xe0/0xe8 [ 147.584018] pc : [<ffffffc0001b0560>] lr : [<ffffffc0001b0560>] pstate: 6 [ 147.672532] sp : ffffffc1f013bca0 [ 147.712107] x29: ffffffc1f013bca0 x28: 0000000000000000 [ 147.782586] x27: 0000000000000000 x26: 0000000000000000 [ 147.846156] x25: 0000000000000000 x24: 0000000000000000 [ 147.909730] x23: ffffffc000692a10 x22: ffffffc1f0059400 [ 147.973305] x21: 0000000000000000 x20: 0000000000000000 [ 148.036879] x19: ffffff8000378000 x18: 0000000000000000 [ 148.100454] x17: 0000007f803d91d8 x16: ffffffc00011b5d8 [ 148.164028] x15: 0000000000000010 x14: 0a29303030383733 [ 148.227598] x13: 3030303866666666 x12: 6666282061657261 [ 148.291173] x11: 206d7620746e6574 x10: 736978656e6f6e20 [ 148.354747] x9 : 2928656572667620 x8 : 0000000000000552 [ 148.418322] x7 : 0000000000000040 x6 : 0000000000000004 [ 148.481897] x5 : ffffffc0610d3ee0 x4 : 0000000000000000 [ 148.545472] x3 : 0000000000000002 x2 : 0000000000000000 [ 148.609046] x1 : 0000000000000000 x0 : 0000000000000038 [ 148.689628] ---[ end trace 21d29f72bdecabc8 ]--- [ 148.738624] Call trace: [ 148.767787] [<ffffffc0001b0560>] __vunmap+0xe0/0xe8 [ 148.826116] [<ffffffc0001b0690>] vunmap+0x28/0x38 [ 148.882362] [<ffffffc00009c6d4>] __iounmap+0x34/0x40 [ 148.941732] [<ffffffc000692a84>] nvme_dev_unmap.isra.27+0x1c/0x38 [ 149.014640] [<ffffffc0006946d8>] nvme_remove+0xd8/0x110 [ 149.077137] [<ffffffc000415abc>] pci_device_remove+0x3c/0x108 [ 149.145898] [<ffffffc000623580>] __device_release_driver+0x80/0xf0 [ 149.219847] [<ffffffc000623614>] device_release_driver+0x24/0x38 [ 149.291698] [<ffffffc00040ea08>] pci_stop_bus_device+0x98/0xa8 [ 149.361481] [<ffffffc00040eb44>] pci_stop_and_remove_bus_device_locked+0x [ 149.450019] [<ffffffc000692a34>] nvme_remove_dead_ctrl+0x24/0x58 [ 149.521886] [<ffffffc0000c085c>] kthread+0xdc/0xf0 [ 149.579170] [<ffffffc000084f90>] ret_from_fork+0x10/0x40 [ 149.642800] Trying to free nonexistent resource <0000000050100000-0000000 [ 149.734456] iommu: Removing device 0000:03:00.0 from group 65[/code]
Hi, we are operating a cluster of 4 TX2's, each node is of same composition: TX2 with nvidia ITX devkit board + our PLX switch (8624) + NVMe SSD (Samsung 960 Pro). There is also our FPGA based camera attached to PLX which captures video through V4L2 drivers and a user-space application makes a file per each frame. Total data rate is 680 MB/s.

When we start recording the frames, in random interval, the NVMe driver is kicked out as a response to the IOMMU protection violation. This happens only when concurrent V4L2 capture and NVMe WRITE is happening. If the NVMe is only read, there is no fault happening. Also I could not trigger it with write only when camera was stopped - likely the iommu is not so busy at that time.

When running on old TX1 node (with 3.10. kernel), everything works without issues for hours of recording (12 min, then erase and do again).

This appears on both R27.1 and R28.1 with the latest L4T kernel.

The faults from various runs at various nodes are all similar:

[  362.090058] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x83829f00, fsynr=0x60003, cb=22, sid=17(0x11 - AFI), pgd=270064003, pud=270064003, pmd=1cf0c9003, pte=0
[ 288.945693] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x842a1e80, fsynr=0x20003, cb=22, sid=17(0x11 - AFI), pgd=270061003, pud=270061003, pmd=2170ae003, pte=0
[ 252.095736] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x839fbe00, fsynr=0x60003, cb=22, sid=17(0x11 - AFI), pgd=2754c5003, pud=2754c5003, pmd=1cc4d8003, pte=0
[ 247.646129] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x8185ff80, fsynr=0x1a0003, cb=22, sid=17(0x11 - AFI), pgd=270068003, pud=270068003, pmd=21bc7d003, pte=0
[ 494.143285] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x8398fe80, fsynr=0x240003, cb=22, sid=17(0x11 - AFI), pgd=27005f003, pud=27005f003, pmd=1cd530003, pte=0
[ 533.487312] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x892e3880, fsynr=0x240003, cb=22, sid=17(0x11 - AFI), pgd=270060003, pud=270060003, pmd=1d3b23003, pte=0
[ 143.097750] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x840d5e80, fsynr=0x240003, cb=22, sid=17(0x11 - AFI), pgd=270077003, pud=270077003, pmd=21c185003, pte=0
[ 262.925572] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x83f3be00, fsynr=0x80003, cb=21, sid=17(0x11 - AFI), pgd=26f87c003, pud=26f87c003, pmd=21ab5e003, pte=0
[ 279.781618] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x83f3be80, fsynr=0x80003, cb=21, sid=17(0x11 - AFI), pgd=26f87c003, pud=26f87c003, pmd=21ab5e003, pte=0
[ 4710.237528] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x83473f80, fsynr=0x240003, cb=22, sid=17(0x11 - AFI), pgd=270068003, pud=270068003, pmd=20ec53003, pte=0



This is how the NVMe gets lost / kicked out - and then no longer listed in lspci either. Swapping the CPU module in this system for TX1 and it works perfect.

[  142.761669] nvme 0000:03:00.0: Failed status: 3, reset controller
[ 142.827824] nvme 0000:03:00.0: Cancelling I/O 806 QID 4
[ 143.097750] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x840d5e80, fsynr=0x240003, cb=22,
[ 143.404216] irq 55: nobody cared (try booting with the "irqpoll" option)
[ 143.477523] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W O 4.4.38+ #5
[ 143.563966] Hardware name: quill (DT)
[ 143.607710] Call trace:
[ 143.636880] [<ffffffc000089a10>] dump_backtrace+0x0/0xe8
[ 143.700411] [<ffffffc000089b0c>] show_stack+0x14/0x20
[ 143.760825] [<ffffffc0003bfbd0>] dump_stack+0xa0/0xc8
[ 143.821235] [<ffffffc0000f6a38>] __report_bad_irq+0x38/0xe8
[ 143.887900] [<ffffffc0000f6dc8>] note_interrupt+0x210/0x2f0
[ 143.954573] [<ffffffc0000f4204>] handle_irq_event_percpu+0x224/0x2a0
[ 144.030623] [<ffffffc0000f42c8>] handle_irq_event+0x48/0x78
[ 144.097249] [<ffffffc0000f7860>] handle_fasteoi_irq+0xb8/0x1b0
[ 144.167033] [<ffffffc0000f367c>] generic_handle_irq+0x24/0x38
[ 144.235777] [<ffffffc0000f397c>] __handle_domain_irq+0x5c/0xb8
[ 144.305564] [<ffffffc0000815b8>] gic_handle_irq+0x68/0xf0
[ 144.370143] [<ffffffc000084740>] el1_irq+0x80/0xf8
[ 144.427429] [<ffffffc0000a70d0>] irq_exit+0x88/0xe0
[ 144.485752] [<ffffffc0000f3980>] __handle_domain_irq+0x60/0xb8
[ 144.555541] [<ffffffc0000815b8>] gic_handle_irq+0x68/0xf0
[ 144.620119] [<ffffffc000084740>] el1_irq+0x80/0xf8
[ 144.677405] [<ffffffc0000c7fc0>] finish_task_switch+0xa8/0x1f8
[ 144.747207] [<ffffffc000c13b4c>] __schedule+0x274/0x7a0
[ 144.749672] nvme 0000:03:00.0: Failed status: 3, reset controller
[ 144.749720] nvme 0000:03:00.0: Cancelling I/O 1 QID 0
[ 144.943019] [<ffffffc000c140bc>] schedule+0x44/0xb8
[ 145.001350] [<ffffffc000c14588>] schedule_preempt_disabled+0x20/0x40
[ 145.077369] [<ffffffc0000e340c>] cpu_startup_entry+0xfc/0x340
[ 145.146111] [<ffffffc000c12150>] rest_init+0x88/0x98
[ 145.205482] [<ffffffc001167978>] start_kernel+0x39c/0x3b0
[ 145.270062] [<0000000080c19000>] 0x80c19000
[ 145.320051] handlers:
[ 145.347133] [<ffffffc0009679d8>] tegra_mcerr_hard_irq threaded [<ffffffc000967a20>] tegra_mcerr_threa
[ 145.465523] Disabling IRQ #55
[ 145.494185] (255) csr_afir: EMEM address decode error
[ 145.554470] status = 0x2032700e; addr = 0x3ffffffc0
[ 145.554595] nvme 0000:03:00.0: Device failed to resume
[ 145.554673] blk_update_request: I/O error, dev nvme0n1, sector 452417536
[ 145.554763] Aborting journal on device nvme0n1-8.
[ 145.554768] Buffer I/O error on dev nvme0n1, logical block 62423040, lost sync page write
[ 145.554770] JBD2: Error -5 detected when updating journal superblock for nvme0n1-8.
[ 145.554796] Buffer I/O error on dev nvme0n1, logical block 0, lost sync page write
[ 145.554800] EXT4-fs error (device nvme0n1): ext4_journal_check_start:56: Detected aborted journal
[ 145.554803] EXT4-fs (nvme0n1): Remounting filesystem read-only
[ 145.554804] EXT4-fs (nvme0n1): previous I/O error to superblock detected
[ 145.554807] Buffer I/O error on dev nvme0n1, logical block 0, lost sync page write
[ 145.554982] Buffer I/O error on dev nvme0n1, logical block 1, lost async page write
[ 145.554988] Buffer I/O error on dev nvme0n1, logical block 1041, lost async page write
[ 145.554992] Buffer I/O error on dev nvme0n1, logical block 1057, lost async page write
[ 145.554996] Buffer I/O error on dev nvme0n1, logical block 9249, lost async page write
[ 146.815802] secure: yes, access-type: read
[ 146.869748] Trying to vfree() nonexistent vm area (ffffff8000378000)
[ 146.942890] ------------[ cut here ]------------
[ 146.998063] WARNING: at ffffffc0001b0560 [verbose debug info unavailable]
[ 147.079300] Modules linked in: bridge stp llc imx183(O) sdma(O) snd_soc_spdif_tx snd_soc_core snd_com

[ 147.279819] CPU: 4 PID: 2171 Comm: nvme0 Tainted: G W O 4.4.38
[ 147.359019] Hardware name: quill (DT)
[ 147.402762] task: ffffffc0610d3e80 ti: ffffffc1f0138000 task.ti: ffffffc1
[ 147.492344] PC is at __vunmap+0xe0/0xe8
[ 147.538185] LR is at __vunmap+0xe0/0xe8
[ 147.584018] pc : [<ffffffc0001b0560>] lr : [<ffffffc0001b0560>] pstate: 6
[ 147.672532] sp : ffffffc1f013bca0
[ 147.712107] x29: ffffffc1f013bca0 x28: 0000000000000000
[ 147.782586] x27: 0000000000000000 x26: 0000000000000000
[ 147.846156] x25: 0000000000000000 x24: 0000000000000000
[ 147.909730] x23: ffffffc000692a10 x22: ffffffc1f0059400
[ 147.973305] x21: 0000000000000000 x20: 0000000000000000
[ 148.036879] x19: ffffff8000378000 x18: 0000000000000000
[ 148.100454] x17: 0000007f803d91d8 x16: ffffffc00011b5d8
[ 148.164028] x15: 0000000000000010 x14: 0a29303030383733
[ 148.227598] x13: 3030303866666666 x12: 6666282061657261
[ 148.291173] x11: 206d7620746e6574 x10: 736978656e6f6e20
[ 148.354747] x9 : 2928656572667620 x8 : 0000000000000552
[ 148.418322] x7 : 0000000000000040 x6 : 0000000000000004
[ 148.481897] x5 : ffffffc0610d3ee0 x4 : 0000000000000000
[ 148.545472] x3 : 0000000000000002 x2 : 0000000000000000
[ 148.609046] x1 : 0000000000000000 x0 : 0000000000000038

[ 148.689628] ---[ end trace 21d29f72bdecabc8 ]---
[ 148.738624] Call trace:
[ 148.767787] [<ffffffc0001b0560>] __vunmap+0xe0/0xe8
[ 148.826116] [<ffffffc0001b0690>] vunmap+0x28/0x38
[ 148.882362] [<ffffffc00009c6d4>] __iounmap+0x34/0x40
[ 148.941732] [<ffffffc000692a84>] nvme_dev_unmap.isra.27+0x1c/0x38
[ 149.014640] [<ffffffc0006946d8>] nvme_remove+0xd8/0x110
[ 149.077137] [<ffffffc000415abc>] pci_device_remove+0x3c/0x108
[ 149.145898] [<ffffffc000623580>] __device_release_driver+0x80/0xf0
[ 149.219847] [<ffffffc000623614>] device_release_driver+0x24/0x38
[ 149.291698] [<ffffffc00040ea08>] pci_stop_bus_device+0x98/0xa8
[ 149.361481] [<ffffffc00040eb44>] pci_stop_and_remove_bus_device_locked+0x
[ 149.450019] [<ffffffc000692a34>] nvme_remove_dead_ctrl+0x24/0x58
[ 149.521886] [<ffffffc0000c085c>] kthread+0xdc/0xf0
[ 149.579170] [<ffffffc000084f90>] ret_from_fork+0x10/0x40
[ 149.642800] Trying to free nonexistent resource <0000000050100000-0000000
[ 149.734456] iommu: Removing device 0000:03:00.0 from group 65

#1
Posted 11/24/2017 09:47 AM   
Hi danieel, How to reproduce your problem? Does this only happen with nvme PCIE ssd? How about other kind of SSD? In your video capture process, is there any application after then? video convert, encode, preview,...,etc.
Hi danieel,

How to reproduce your problem? Does this only happen with nvme PCIE ssd? How about other kind of SSD?
In your video capture process, is there any application after then? video convert, encode, preview,...,etc.

#2
Posted 11/27/2017 03:27 AM   
Hi WayneWWW, this happens also with AHCI M2 SSD (Samsung XP941): [code][ 275.472244] ata1.00: exception Emask 0x20 SAct 0x18 SErr 0x0 action 0x6 frozen [ 275.479465] ata1.00: irq_stat 0x20000000, host bus error [ 275.484781] ata1.00: failed command: WRITE FPDMA QUEUED [ 275.490015] ata1.00: cmd 61/00:18:00:88:2c/40:00:1d:00:00/40 tag 3 ncq 8388608 out res 40/00:20:00:c8:2c/00:00:1d:00:00/40 Emask 0x20 (host bus error) [ 275.505649] ata1.00: status: { DRDY } [ 275.509313] ata1.00: failed command: WRITE FPDMA QUEUED [ 275.514539] ata1.00: cmd 61/10:20:00:c8:2c/1a:00:1d:00:00/40 tag 4 ncq 3416064 out res 40/00:20:00:c8:2c/00:00:1d:00:00/40 Emask 0x20 (host bus error) [ 275.530164] ata1.00: status: { DRDY } [ 275.533832] ata1: hard resetting link [ 275.864247] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 275.888227] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x100) [ 275.894487] ata1.00: revalidation failed (errno=-5) [ 280.868224] ata1: hard resetting link [ 281.196247] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 281.220224] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x100) [ 281.226484] ata1.00: revalidation failed (errno=-5) [ 281.231368] ata1: limiting SATA link speed to 3.0 Gbps [ 286.200242] ata1: hard resetting link [ 286.532255] ata1: SATA link down (SStatus 0 SControl 320) [ 286.537678] ata1.00: disabled [ 286.541390] sd 0:0:0:0: [sda] tag#3 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 [ 286.549941] sd 0:0:0:0: [sda] tag#3 Sense Key : 0x5 [current] [descriptor] [ 286.556916] sd 0:0:0:0: [sda] tag#3 ASC=0x21 ASCQ=0x4 [ 286.562074] sd 0:0:0:0: [sda] tag#3 CDB: opcode=0x2a 2a 00 1d 2c 88 00 00 40 00 00 [ 286.569646] blk_update_request: I/O error, dev sda, sector 489457664 [ 286.576047] sd 0:0:0:0: rejecting I/O to offline device [ 286.581267] sd 0:0:0:0: [sda] killing request [/code] The 950 Pro and 960 Pro NVMe SSDs show this: [code][ 766.753809] nvme 0000:03:00.0: Failed status: 3, reset controller[/code] The status shows the NVMe CSTS (controller status register), with bits RDY (0x01) and CFS (0x02) - controller fatal status set. The driver in the TX1 (3.10 kernel) does not check/work with this bit at all (and NVMe driver is structured differently). We have also applied this NVMe patch: https://lkml.org/lkml/2017/11/22/219 (as it is present int 4.4.102 kernel), but it did not resolve the issue. The reproduceability is hard - the issue does not show up when our PCIe camera is not running / capturing data through V4L2 (e.g. with iperf on 10GE network card + writing to NVMe). Also, when we limit our framerate to 30fps and not using full 60fps, the crash happens after more than 10 minutes. There is less interrupts (one per frame) and less data (but that flows in opposite direction than the files written to SSD, so that should not render the device starving). There is no encode happening. The chain is: Camera->PCIe-V4L2->application->CUDA->OpenGL which gets our data onto the screen, and RAW video buffers from V4L2 are written with O_DIRECT to ext4 on NVMe drive. But the crash happens also when this viewer app is running and we run an independent [i]dd if=zero of=/mnt/nvme/test bs=1M[/i] process, so that the issue is not with the data buffers being shared. I suspect this is maybe interrupt related, we are seeing these errors - the mc_status interrupt 55 gets crashing by the way: [code][ 145.320051] handlers: [ 145.347133] [<ffffffc0009679d8>] tegra_mcerr_hard_irq threaded [<ffffffc000967a20>] tegra_mcerr_thread [ 145.465523] Disabling IRQ #55 [ 145.494185] (255) csr_afir: EMEM address decode error [ 145.554470] status = 0x2032700e; addr = 0x3ffffffc0[/code] The IOMMU unhandled context faults are always related to NVMe mappings. Disabling IOMMU for PCIe (removing AFI sub-node) does not help, we are still getting the same "status 3" messages and EMEM address decode errors. A similar traffic related issue (yet unsolved) is to be seen here: https://forum.rocketboards.org/t/altera-pcie-driver-issue-with-ssd-devices/545/4
Hi WayneWWW, this happens also with AHCI M2 SSD (Samsung XP941):

[  275.472244] ata1.00: exception Emask 0x20 SAct 0x18 SErr 0x0 action 0x6 frozen
[ 275.479465] ata1.00: irq_stat 0x20000000, host bus error
[ 275.484781] ata1.00: failed command: WRITE FPDMA QUEUED
[ 275.490015] ata1.00: cmd 61/00:18:00:88:2c/40:00:1d:00:00/40 tag 3 ncq 8388608 out
res 40/00:20:00:c8:2c/00:00:1d:00:00/40 Emask 0x20 (host bus error)
[ 275.505649] ata1.00: status: { DRDY }
[ 275.509313] ata1.00: failed command: WRITE FPDMA QUEUED
[ 275.514539] ata1.00: cmd 61/10:20:00:c8:2c/1a:00:1d:00:00/40 tag 4 ncq 3416064 out
res 40/00:20:00:c8:2c/00:00:1d:00:00/40 Emask 0x20 (host bus error)
[ 275.530164] ata1.00: status: { DRDY }
[ 275.533832] ata1: hard resetting link
[ 275.864247] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 275.888227] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x100)
[ 275.894487] ata1.00: revalidation failed (errno=-5)
[ 280.868224] ata1: hard resetting link
[ 281.196247] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 281.220224] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x100)
[ 281.226484] ata1.00: revalidation failed (errno=-5)
[ 281.231368] ata1: limiting SATA link speed to 3.0 Gbps
[ 286.200242] ata1: hard resetting link
[ 286.532255] ata1: SATA link down (SStatus 0 SControl 320)
[ 286.537678] ata1.00: disabled
[ 286.541390] sd 0:0:0:0: [sda] tag#3 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
[ 286.549941] sd 0:0:0:0: [sda] tag#3 Sense Key : 0x5 [current] [descriptor]
[ 286.556916] sd 0:0:0:0: [sda] tag#3 ASC=0x21 ASCQ=0x4
[ 286.562074] sd 0:0:0:0: [sda] tag#3 CDB: opcode=0x2a 2a 00 1d 2c 88 00 00 40 00 00
[ 286.569646] blk_update_request: I/O error, dev sda, sector 489457664
[ 286.576047] sd 0:0:0:0: rejecting I/O to offline device
[ 286.581267] sd 0:0:0:0: [sda] killing request


The 950 Pro and 960 Pro NVMe SSDs show this:
[  766.753809] nvme 0000:03:00.0: Failed status: 3, reset controller


The status shows the NVMe CSTS (controller status register), with bits RDY (0x01) and CFS (0x02) - controller fatal status set. The driver in the TX1 (3.10 kernel) does not check/work with this bit at all (and NVMe driver is structured differently).

We have also applied this NVMe patch: https://lkml.org/lkml/2017/11/22/219 (as it is present int 4.4.102 kernel), but it did not resolve the issue.

The reproduceability is hard - the issue does not show up when our PCIe camera is not running / capturing data through V4L2 (e.g. with iperf on 10GE network card + writing to NVMe). Also, when we limit our framerate to 30fps and not using full 60fps, the crash happens after more than 10 minutes. There is less interrupts (one per frame) and less data (but that flows in opposite direction than the files written to SSD, so that should not render the device starving).

There is no encode happening. The chain is: Camera->PCIe-V4L2->application->CUDA->OpenGL which gets our data onto the screen, and RAW video buffers from V4L2 are written with O_DIRECT to ext4 on NVMe drive. But the crash happens also when this viewer app is running and we run an independent dd if=zero of=/mnt/nvme/test bs=1M process, so that the issue is not with the data buffers being shared.

I suspect this is maybe interrupt related, we are seeing these errors - the mc_status interrupt 55 gets crashing by the way:

[  145.320051] handlers:
[ 145.347133] [<ffffffc0009679d8>] tegra_mcerr_hard_irq threaded [<ffffffc000967a20>] tegra_mcerr_thread
[ 145.465523] Disabling IRQ #55
[ 145.494185] (255) csr_afir: EMEM address decode error
[ 145.554470] status = 0x2032700e; addr = 0x3ffffffc0


The IOMMU unhandled context faults are always related to NVMe mappings. Disabling IOMMU for PCIe (removing AFI sub-node) does not help, we are still getting the same "status 3" messages and EMEM address decode errors.

A similar traffic related issue (yet unsolved) is to be seen here: https://forum.rocketboards.org/t/altera-pcie-driver-issue-with-ssd-devices/545/4

#3
Posted 11/27/2017 10:04 AM   
Hi danieel, My comment, - TX1 node with 3.10. kernel is OK. You can also try to run TX1 with r28.1 BSP to see if the issue exists. This way you can tell is it related to BSP/kernel sw only or could also be platform relevant. - Similar issue under, https://forum.rocketboards.org/t/altera-pcie-driver-issue-with-ssd-devices/545/4 is kernel 4.1 (similar to kernel 4.4 than 3.10) and point the issue to more kernel related but yet to be confirmed. - "The status shows the NVMe CSTS (controller status register), with bits RDY (0x01) and CFS (0x02) - controller fatal status set. The driver in the TX1 (3.10 kernel) does not check/work with this bit at all (and NVMe driver is structured differently). => a quick experiments, for tx1/k3.10, you could add the checking code to simply dump the bits status but let normal op proceed. Just to confirm this is running at 60fps operation and everything is normal. - "when we limit our framerate to 30fps and not using full 60fps, the crash happens after more than 10 minutes. => is this very consistent? Meaning if you repeat a few times, the behavior is similar? - seems the issue is related to system loading or interrupt which triggers exception for some reason.
Hi danieel,

My comment,

- TX1 node with 3.10. kernel is OK. You can also try to run TX1 with r28.1 BSP to see if the issue exists. This way you can tell is it related to BSP/kernel sw only or could also be platform relevant.
- Similar issue under,

https://forum.rocketboards.org/t/altera-pcie-driver-issue-with-ssd-devices/545/4

is kernel 4.1 (similar to kernel 4.4 than 3.10) and point the issue to more kernel related but yet to be confirmed.
- "The status shows the NVMe CSTS (controller status register), with bits RDY (0x01) and CFS (0x02) - controller fatal status set. The driver in the TX1 (3.10 kernel) does not check/work with this bit at all (and NVMe driver is structured differently).
=> a quick experiments,
for tx1/k3.10, you could add the checking code to simply dump the bits status but let normal op proceed. Just to confirm this is running at 60fps operation and everything is normal.
- "when we limit our framerate to 30fps and not using full 60fps, the crash happens after more than 10 minutes.
=> is this very consistent? Meaning if you repeat a few times, the behavior is similar?
- seems the issue is related to system loading or interrupt which triggers exception for some reason.

#4
Posted 11/27/2017 07:41 PM   
Hi danieel, Let's narrow down the case when error happened. Please correct me if my understanding is wrong. Your usecase is a using a PCIe camera with 2 pipeline: to display and to SSD with NVMe or AHCI M2. (60fps) Could this be reproduced when only the SSD pipeline being launched? How about the 10GE network card? Is it also needed to reproduce error?
Hi danieel,

Let's narrow down the case when error happened. Please correct me if my understanding is wrong.

Your usecase is a using a PCIe camera with 2 pipeline: to display and to SSD with NVMe or AHCI M2. (60fps)


Could this be reproduced when only the SSD pipeline being launched?

How about the 10GE network card? Is it also needed to reproduce error?

#5
Posted 11/28/2017 02:46 AM   
Have you tried disabling SMMU for PCIe? If not, it is worth giving it a try.
Have you tried disabling SMMU for PCIe? If not, it is worth giving it a try.

#6
Posted 11/28/2017 02:56 AM   
On TX1 with R28.1 system and 4.4.38 nvidia kernel, we got this SMMU fault: [code][ 612.030983] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2 [ 612.031015] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry [ 612.031020] mc-err: status = 0x6000000e; addr = 0x00000000 [ 612.031029] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s [ 612.031359] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2 [ 612.031371] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry [ 612.031375] mc-err: status = 0x6000000e; addr = 0x00000000 [ 612.031380] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s [ 612.031401] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2 [ 612.031405] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry [ 612.031409] mc-err: status = 0x6000000e; addr = 0x00000000 [ 612.031413] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s [ 612.031430] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2 [ 612.031434] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry [ 612.031438] mc-err: status = 0x6000000e; addr = 0x00000000 [ 612.031442] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s [ 612.031455] mc-err: Too many MC errors; throttling prints [/code] Does that [b]EMEM decode error on PDE or PTE[/b] entry mean that the peripheral is accessing the address in question, or the SMMU itself was trying to resolve a multi-level translation table?
On TX1 with R28.1 system and 4.4.38 nvidia kernel, we got this SMMU fault:
[  612.030983] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2
[ 612.031015] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry
[ 612.031020] mc-err: status = 0x6000000e; addr = 0x00000000
[ 612.031029] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[ 612.031359] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2
[ 612.031371] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry
[ 612.031375] mc-err: status = 0x6000000e; addr = 0x00000000
[ 612.031380] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[ 612.031401] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2
[ 612.031405] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry
[ 612.031409] mc-err: status = 0x6000000e; addr = 0x00000000
[ 612.031413] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[ 612.031430] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2
[ 612.031434] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry
[ 612.031438] mc-err: status = 0x6000000e; addr = 0x00000000
[ 612.031442] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[ 612.031455] mc-err: Too many MC errors; throttling prints


Does that EMEM decode error on PDE or PTE entry mean that the peripheral is accessing the address in question, or the SMMU itself was trying to resolve a multi-level translation table?

#7
Posted 11/28/2017 08:33 PM   
Hi danieel, Please try following patch to disable SMMU. [code] diff --git a/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi b/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi index 5c6536b968ab..da6eee63670e 100644 --- a/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi +++ b/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi @@ -186,7 +186,6 @@ <&tegra_adsp_audio TEGRA_SID_APE>, <&{/sound} TEGRA_SID_APE>, <&{/sound_ref} TEGRA_SID_APE>, - <&{/pcie-controller@10003000} TEGRA_SID_AFI>, <&{/ahci-sata@3507000} TEGRA_SID_SATA2>, <&{/aon@c160000} TEGRA_SID_AON>, <&{/rtcpu@b000000} TEGRA_SID_RCE>, @@ -1509,8 +1508,6 @@ interrupt-map-mask = <0 0 0 0>; interrupt-map = <0 0 0 0 &intc 0 72 0x04>;// check this - #stream-id-cells = <1>; - bus-range = <0x00 0xff>; #address-cells = <3>; #size-cells = <2>; [/code]
Hi danieel,

Please try following patch to disable SMMU.
diff --git a/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi b/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi
index 5c6536b968ab..da6eee63670e 100644
--- a/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi
+++ b/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi
@@ -186,7 +186,6 @@
<&tegra_adsp_audio TEGRA_SID_APE>,
<&{/sound} TEGRA_SID_APE>,
<&{/sound_ref} TEGRA_SID_APE>,
- <&{/pcie-controller@10003000} TEGRA_SID_AFI>,
<&{/ahci-sata@3507000} TEGRA_SID_SATA2>,
<&{/aon@c160000} TEGRA_SID_AON>,
<&{/rtcpu@b000000} TEGRA_SID_RCE>,
@@ -1509,8 +1508,6 @@
interrupt-map-mask = <0 0 0 0>;
interrupt-map = <0 0 0 0 &intc 0 72 0x04>;// check this

- #stream-id-cells = <1>;
-
bus-range = <0x00 0xff>;
#address-cells = <3>;
#size-cells = <2>;

#8
Posted 11/29/2017 02:49 AM   
[quote=""]On TX1 with R28.1 system and 4.4.38 nvidia kernel, we got this SMMU fault: [code][ 612.030983] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2 [ 612.031015] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry [ 612.031020] mc-err: status = 0x6000000e; addr = 0x00000000 [ 612.031029] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s [ 612.031359] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2 [ 612.031371] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry [ 612.031375] mc-err: status = 0x6000000e; addr = 0x00000000 [ 612.031380] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s [ 612.031401] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2 [ 612.031405] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry [ 612.031409] mc-err: status = 0x6000000e; addr = 0x00000000 [ 612.031413] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s [ 612.031430] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2 [ 612.031434] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry [ 612.031438] mc-err: status = 0x6000000e; addr = 0x00000000 [ 612.031442] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s [ 612.031455] mc-err: Too many MC errors; throttling prints [/code] Does tha [b]EMEM decode error on PDE or PTE[/b] entry mean that the peripheral is accessing the address in question, or the SMMU itself was trying to resolve a multi-level translation table? [/quote] It means that there is an access to 0x0000000000000000 address by the respective IP. In this case PCIe.
said:On TX1 with R28.1 system and 4.4.38 nvidia kernel, we got this SMMU fault:
[  612.030983] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2
[ 612.031015] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry
[ 612.031020] mc-err: status = 0x6000000e; addr = 0x00000000
[ 612.031029] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[ 612.031359] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2
[ 612.031371] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry
[ 612.031375] mc-err: status = 0x6000000e; addr = 0x00000000
[ 612.031380] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[ 612.031401] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2
[ 612.031405] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry
[ 612.031409] mc-err: status = 0x6000000e; addr = 0x00000000
[ 612.031413] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[ 612.031430] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2
[ 612.031434] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry
[ 612.031438] mc-err: status = 0x6000000e; addr = 0x00000000
[ 612.031442] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[ 612.031455] mc-err: Too many MC errors; throttling prints


Does tha EMEM decode error on PDE or PTE entry mean that the peripheral is accessing the address in question, or the SMMU itself was trying to resolve a multi-level translation table?


It means that there is an access to 0x0000000000000000 address by the respective IP. In this case PCIe.

#9
Posted 11/29/2017 03:08 AM   
Hi WayneWWW, with disabled IOMMU on TX2 according to your diff, we get this fault: [code][ 243.762393] nvme 0000:03:00.0: Failed status: 3, reset controller [ 243.762504] nvme 0000:03:00.0: Cancelling I/O 258 QID 2 [ 243.762514] nvme 0000:03:00.0: Cancelling I/O 259 QID 2 [ 244.762434] nvme 0000:03:00.0: Failed status: 3, reset controller [ 244.762476] nvme 0000:03:00.0: Cancelling I/O 1 QID 0 [ 244.762572] nvme 0000:03:00.0: Device failed to resume [ 244.762643] blk_update_request: I/O error, dev nvme0n1, sector 173578240 [ 244.762651] blk_update_request: I/O error, dev nvme0n1, sector 173576192 [ 244.763213] Aborting journal on device nvme0n1-8. [ 244.763222] Buffer I/O error on dev nvme0n1, logical block 62423040, lost sync page write [ 244.763226] JBD2: Error -5 detected when updating journal superblock for nvme0n1-8. [ 244.781194] Trying to vfree() nonexistent vm area (ffffff801260c000) [ 244.781205] ------------[ cut here ]------------ [ 244.781208] WARNING: at ffffffc0001b0560 [verbose debug info unavailable] [ 244.781210] Modules linked in: nvme imx183(O) sdma(O) snd_soc_spdif_tx snd_soc_core ixgbe snd_compress bcmdhd snd_pcm snd_timer snd soundcore ahci_tegra libahci_platform libahci bluedroid_pm [last unloaded: nvme] [ 244.781237] CPU: 4 PID: 2441 Comm: nvme0 Tainted: G O 4.4.38+ #12 [ 244.781240] Hardware name: quill (DT) [ 244.781243] task: ffffffc1e6697080 ti: ffffffc1e67f8000 task.ti: ffffffc1e67f8000 [ 244.781249] PC is at __vunmap+0xe0/0xe8 [ 244.781252] LR is at __vunmap+0xe0/0xe8 [ 244.781255] pc : [<ffffffc0001b0560>] lr : [<ffffffc0001b0560>] pstate: 60000045 [ 244.781257] sp : ffffffc1e67fbca0 [ 244.781259] x29: ffffffc1e67fbca0 x28: 0000000000000000 [ 244.781263] x27: 0000000000000000 x26: 0000000000000000 [ 244.781266] x25: 0000000000000000 x24: 0000000000000000 [ 244.781269] x23: ffffffbffc0a1198 x22: ffffffc06f83a000 [ 244.781272] x21: 0000000000000000 x20: 0000000000000000 [ 244.781274] x19: ffffff801260c000 x18: 0000000000000000 [ 244.781277] x17: 0000000000000007 x16: 0000000000000001 [ 244.781280] x15: 0000000000000010 x14: ffffffc081449297 [ 244.781282] x13: ffffffc0014492a5 x12: 0000000000000006 [ 244.781285] x11: 0000000000035c7c x10: 0000000005f5e0ff [ 244.781288] x9 : ffffffc1e67fba20 x8 : 0000000000035c7d [ 244.781291] x7 : 6666666666282061 x6 : ffffffc0014492df [ 244.781294] x5 : 0000000000000000 x4 : 0000000000000000 [ 244.781296] x3 : 0000000000000000 x2 : ffffffc1e67f8000 [ 244.781299] x1 : 0000000000000000 x0 : 0000000000000038 [ 244.781303] ---[ end trace a4f9aef1c36b3e46 ]--- [ 244.781306] Call trace: [ 244.803594] [<ffffffc0001b0560>] __vunmap+0xe0/0xe8 [ 244.803599] [<ffffffc0001b0690>] vunmap+0x28/0x38 [ 244.803604] [<ffffffc00009c6d4>] __iounmap+0x34/0x40 [ 244.803615] [<ffffffbffc0a120c>] nvme_dev_unmap.isra.26+0x1c/0x38 [nvme] [ 244.803623] [<ffffffbffc0a3028>] nvme_remove+0xd0/0x118 [nvme] [ 244.803628] [<ffffffc000417c5c>] pci_device_remove+0x3c/0x108 [ 244.803633] [<ffffffc00062ffc8>] __device_release_driver+0x80/0xf0 [ 244.803636] [<ffffffc00063005c>] device_release_driver+0x24/0x38 [ 244.803640] [<ffffffc000410ba8>] pci_stop_bus_device+0x98/0xa8 [ 244.803643] [<ffffffc000410ce4>] pci_stop_and_remove_bus_device_locked+0x1c/0x38 [ 244.803651] [<ffffffbffc0a11bc>] nvme_remove_dead_ctrl+0x24/0x58 [nvme] [ 244.803656] [<ffffffc0000c085c>] kthread+0xdc/0xf0 [ 244.803659] [<ffffffc000084f90>] ret_from_fork+0x10/0x40 [ 244.804516] Trying to free nonexistent resource <0000000050100000-0000000050103fff> [/code] The result here is that TX2 PCIe host stops responding to reads. We are not sure what condition in the Samsung NVMe triggers the CFS bit to be set, but we put a LED indicator on our FPGA, to show whether a completion comes for a read initiated by it. And we now clearly see that it does not come when the crash occurs.
Hi WayneWWW, with disabled IOMMU on TX2 according to your diff, we get this fault:

[  243.762393] nvme 0000:03:00.0: Failed status: 3, reset controller
[ 243.762504] nvme 0000:03:00.0: Cancelling I/O 258 QID 2
[ 243.762514] nvme 0000:03:00.0: Cancelling I/O 259 QID 2
[ 244.762434] nvme 0000:03:00.0: Failed status: 3, reset controller
[ 244.762476] nvme 0000:03:00.0: Cancelling I/O 1 QID 0
[ 244.762572] nvme 0000:03:00.0: Device failed to resume
[ 244.762643] blk_update_request: I/O error, dev nvme0n1, sector 173578240
[ 244.762651] blk_update_request: I/O error, dev nvme0n1, sector 173576192
[ 244.763213] Aborting journal on device nvme0n1-8.
[ 244.763222] Buffer I/O error on dev nvme0n1, logical block 62423040, lost sync page write
[ 244.763226] JBD2: Error -5 detected when updating journal superblock for nvme0n1-8.
[ 244.781194] Trying to vfree() nonexistent vm area (ffffff801260c000)
[ 244.781205] ------------[ cut here ]------------
[ 244.781208] WARNING: at ffffffc0001b0560 [verbose debug info unavailable]
[ 244.781210] Modules linked in: nvme imx183(O) sdma(O) snd_soc_spdif_tx snd_soc_core ixgbe snd_compress bcmdhd snd_pcm snd_timer snd soundcore ahci_tegra libahci_platform libahci bluedroid_pm [last unloaded: nvme]

[ 244.781237] CPU: 4 PID: 2441 Comm: nvme0 Tainted: G O 4.4.38+ #12
[ 244.781240] Hardware name: quill (DT)
[ 244.781243] task: ffffffc1e6697080 ti: ffffffc1e67f8000 task.ti: ffffffc1e67f8000
[ 244.781249] PC is at __vunmap+0xe0/0xe8
[ 244.781252] LR is at __vunmap+0xe0/0xe8
[ 244.781255] pc : [<ffffffc0001b0560>] lr : [<ffffffc0001b0560>] pstate: 60000045
[ 244.781257] sp : ffffffc1e67fbca0
[ 244.781259] x29: ffffffc1e67fbca0 x28: 0000000000000000
[ 244.781263] x27: 0000000000000000 x26: 0000000000000000
[ 244.781266] x25: 0000000000000000 x24: 0000000000000000
[ 244.781269] x23: ffffffbffc0a1198 x22: ffffffc06f83a000
[ 244.781272] x21: 0000000000000000 x20: 0000000000000000
[ 244.781274] x19: ffffff801260c000 x18: 0000000000000000
[ 244.781277] x17: 0000000000000007 x16: 0000000000000001
[ 244.781280] x15: 0000000000000010 x14: ffffffc081449297
[ 244.781282] x13: ffffffc0014492a5 x12: 0000000000000006
[ 244.781285] x11: 0000000000035c7c x10: 0000000005f5e0ff
[ 244.781288] x9 : ffffffc1e67fba20 x8 : 0000000000035c7d
[ 244.781291] x7 : 6666666666282061 x6 : ffffffc0014492df
[ 244.781294] x5 : 0000000000000000 x4 : 0000000000000000
[ 244.781296] x3 : 0000000000000000 x2 : ffffffc1e67f8000
[ 244.781299] x1 : 0000000000000000 x0 : 0000000000000038

[ 244.781303] ---[ end trace a4f9aef1c36b3e46 ]---
[ 244.781306] Call trace:
[ 244.803594] [<ffffffc0001b0560>] __vunmap+0xe0/0xe8
[ 244.803599] [<ffffffc0001b0690>] vunmap+0x28/0x38
[ 244.803604] [<ffffffc00009c6d4>] __iounmap+0x34/0x40
[ 244.803615] [<ffffffbffc0a120c>] nvme_dev_unmap.isra.26+0x1c/0x38 [nvme]
[ 244.803623] [<ffffffbffc0a3028>] nvme_remove+0xd0/0x118 [nvme]
[ 244.803628] [<ffffffc000417c5c>] pci_device_remove+0x3c/0x108
[ 244.803633] [<ffffffc00062ffc8>] __device_release_driver+0x80/0xf0
[ 244.803636] [<ffffffc00063005c>] device_release_driver+0x24/0x38
[ 244.803640] [<ffffffc000410ba8>] pci_stop_bus_device+0x98/0xa8
[ 244.803643] [<ffffffc000410ce4>] pci_stop_and_remove_bus_device_locked+0x1c/0x38
[ 244.803651] [<ffffffbffc0a11bc>] nvme_remove_dead_ctrl+0x24/0x58 [nvme]
[ 244.803656] [<ffffffc0000c085c>] kthread+0xdc/0xf0
[ 244.803659] [<ffffffc000084f90>] ret_from_fork+0x10/0x40
[ 244.804516] Trying to free nonexistent resource <0000000050100000-0000000050103fff>


The result here is that TX2 PCIe host stops responding to reads. We are not sure what condition in the Samsung NVMe triggers the CFS bit to be set, but we put a LED indicator on our FPGA, to show whether a completion comes for a read initiated by it. And we now clearly see that it does not come when the crash occurs.

#10
Posted 11/29/2017 01:36 PM   
[quote="vidyas"][quote="danieel"] Does that [b]EMEM decode error on PDE or PTE[/b] entry mean that the peripheral is accessing the address in question, or the SMMU itself was trying to resolve a multi-level translation table? [/quote] It means that there is an access to 0x0000000000000000 address by the respective IP. In this case PCIe.[/quote] I would replace *PCIe* in that sentence with SMMU. Since the drivers/platform/tegra/mc/mcerr.c lists this fault as internal to the SMMU: [code]// /* * SMMU related faults. */ MC_ERR(MC_INT_INVALID_SMMU_PAGE, "SMMU address translation fault", E_SMMU, MC_ERR_STATUS, MC_ERR_ADR), MC_ERR(MC_INT_INVALID_SMMU_PAGE | MC_INT_DECERR_EMEM, "EMEM decode error on PDE or PTE entry", E_SMMU, MC_ERR_STATUS, MC_ERR_ADR), MC_ERR(MC_INT_INVALID_SMMU_PAGE | MC_INT_SECERR_SEC, "secure SMMU address translation fault", E_SMMU, MC_ERR_SEC_STATUS, MC_ERR_SEC_ADR), MC_ERR(MC_INT_INVALID_SMMU_PAGE | MC_INT_DECERR_VPR, "VPR SMMU address translation fault", E_SMMU, MC_ERR_VPR_STATUS, MC_ERR_VPR_ADR), MC_ERR(MC_INT_INVALID_SMMU_PAGE | MC_INT_DECERR_VPR | MC_INT_DECERR_EMEM, "EMEM decode error on PDE or PTE entry on VPR context", E_SMMU, MC_ERR_VPR_STATUS, MC_ERR_VPR_ADR), [/code] So the result here is: TX1 SMMU crashes while it is trying to figure out if some transaction shall pass or not.
vidyas said:
danieel said:
Does that EMEM decode error on PDE or PTE entry mean that the peripheral is accessing the address in question, or the SMMU itself was trying to resolve a multi-level translation table?


It means that there is an access to 0x0000000000000000 address by the respective IP. In this case PCIe.


I would replace *PCIe* in that sentence with SMMU. Since the drivers/platform/tegra/mc/mcerr.c lists this fault as internal to the SMMU:

//
/*
* SMMU related faults.
*/
MC_ERR(MC_INT_INVALID_SMMU_PAGE,
"SMMU address translation fault",
E_SMMU, MC_ERR_STATUS, MC_ERR_ADR),
MC_ERR(MC_INT_INVALID_SMMU_PAGE | MC_INT_DECERR_EMEM,
"EMEM decode error on PDE or PTE entry",
E_SMMU, MC_ERR_STATUS, MC_ERR_ADR),
MC_ERR(MC_INT_INVALID_SMMU_PAGE | MC_INT_SECERR_SEC,
"secure SMMU address translation fault",
E_SMMU, MC_ERR_SEC_STATUS, MC_ERR_SEC_ADR),
MC_ERR(MC_INT_INVALID_SMMU_PAGE | MC_INT_DECERR_VPR,
"VPR SMMU address translation fault",
E_SMMU, MC_ERR_VPR_STATUS, MC_ERR_VPR_ADR),
MC_ERR(MC_INT_INVALID_SMMU_PAGE | MC_INT_DECERR_VPR |
MC_INT_DECERR_EMEM,
"EMEM decode error on PDE or PTE entry on VPR context",
E_SMMU, MC_ERR_VPR_STATUS, MC_ERR_VPR_ADR),


So the result here is: TX1 SMMU crashes while it is trying to figure out if some transaction shall pass or not.

#11
Posted 11/29/2017 01:45 PM   
TX1 (4.4.38, R28.1) without AFI specified in IOMMU portion of devicetree crashes on this error: [code] [ 785.811564] nvme 0000:03:00.0: Failed status: 3, reset controller [ 785.811674] nvme 0000:03:00.0: Cancelling I/O 863 QID 2 [ 785.811695] nvme 0000:03:00.0: Cancelling I/O 865 QID 2 [ 785.915656] smmu_dump_pagetable(): fault_address=0x0000000080de3e00 pa=0x0000000000000e00 bytes=1000 #pte=764 in L2 [ 785.915666] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry [ 785.915670] mc-err: status = 0x6000000e; addr = 0x80de3e00 [ 785.915675] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s [ 785.915693] smmu_dump_pagetable(): fault_address=0x0000000080de3e00 pa=0x0000000000000e00 bytes=1000 #pte=764 in L2 [ 785.915697] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry [ 785.915701] mc-err: status = 0x6000000e; addr = 0x80de3e00 [ 785.915705] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s [ 785.915718] smmu_dump_pagetable(): fault_address=0x0000000080de3e00 pa=0x0000000000000e00 bytes=1000 #pte=764 in L2 [ 785.915723] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry [ 785.915726] mc-err: status = 0x6000000e; addr = 0x80de3e00 [ 785.915730] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s [ 785.915744] smmu_dump_pagetable(): fault_address=0x0000000080de3e00 pa=0x0000000000000e00 bytes=1000 #pte=764 in L2 [ 785.915748] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry [ 785.915751] mc-err: status = 0x6000000e; addr = 0x80de3e00 [ 785.915755] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s [ 785.915765] mc-err: Too many MC errors; throttling prints [ 786.799547] nvme 0000:03:00.0: Failed status: 3, reset controller [/code] The address 0x80de3e00 does not belong to our V4L2 driver (anyway by the nature of the error message it seems to be internal to SMMU)
TX1 (4.4.38, R28.1) without AFI specified in IOMMU portion of devicetree crashes on this error:

[  785.811564] nvme 0000:03:00.0: Failed status: 3, reset controller
[ 785.811674] nvme 0000:03:00.0: Cancelling I/O 863 QID 2
[ 785.811695] nvme 0000:03:00.0: Cancelling I/O 865 QID 2
[ 785.915656] smmu_dump_pagetable(): fault_address=0x0000000080de3e00 pa=0x0000000000000e00 bytes=1000 #pte=764 in L2
[ 785.915666] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry
[ 785.915670] mc-err: status = 0x6000000e; addr = 0x80de3e00
[ 785.915675] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[ 785.915693] smmu_dump_pagetable(): fault_address=0x0000000080de3e00 pa=0x0000000000000e00 bytes=1000 #pte=764 in L2
[ 785.915697] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry
[ 785.915701] mc-err: status = 0x6000000e; addr = 0x80de3e00
[ 785.915705] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[ 785.915718] smmu_dump_pagetable(): fault_address=0x0000000080de3e00 pa=0x0000000000000e00 bytes=1000 #pte=764 in L2
[ 785.915723] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry
[ 785.915726] mc-err: status = 0x6000000e; addr = 0x80de3e00
[ 785.915730] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[ 785.915744] smmu_dump_pagetable(): fault_address=0x0000000080de3e00 pa=0x0000000000000e00 bytes=1000 #pte=764 in L2
[ 785.915748] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry
[ 785.915751] mc-err: status = 0x6000000e; addr = 0x80de3e00
[ 785.915755] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[ 785.915765] mc-err: Too many MC errors; throttling prints
[ 786.799547] nvme 0000:03:00.0: Failed status: 3, reset controller


The address 0x80de3e00 does not belong to our V4L2 driver (anyway by the nature of the error message it seems to be internal to SMMU)

#12
Posted 11/29/2017 08:01 PM   
Turning the SMMU back on on the TX1 I have found some IOVA mappings using sysfs. The file [b]/sys/kernel/debug/70019000.iommu/as010/iovainfo[/b] shows regions which are supposedly being the mapped ones. The issue here however is, that the regions shown there are incomplete. That is - I have parsed through the long dmesg output from our driver and identified all pages which are being accessed by the FPGA: [code]Frame 1 maps 2882 unique pages (11528 KiB) for 2160 lines : Frame 16 maps 2882 unique pages (11528 KiB) for 2160 lines Frame buffers request mapping of 46112 pages from which are 46112 unique (184448 KiB) Program for DMA processor takes 163 unique pages (652 KiB) [/code] However when I try to match those pages to the iovainfo, only a portion of them actually matches: [code] IOMMU maps 23272 unique pages (93088 KiB) Orphaned 89 iommu pages Orphaned 23041 pages in 16 frames (1395,1579,1140,1476,1466,1674,1475,1447,1314,1520,1464,1494,1246,1502,1432,1417) Orphaned 51 pages in DMA processor program [/code] With manually checking mappings of some buffers I can see that they are not in the iommu list. But the program still runs and causes no violations, despite the listing of mappings being incomplete. The DMA program area and frame data buffer translations are valid from start of a V4L2 client till the eventual crash.
Turning the SMMU back on on the TX1 I have found some IOVA mappings using sysfs. The file /sys/kernel/debug/70019000.iommu/as010/iovainfo shows regions which are supposedly being the mapped ones.

The issue here however is, that the regions shown there are incomplete. That is - I have parsed through the long dmesg output from our driver and identified all pages which are being accessed by the FPGA:

Frame 1 maps 2882 unique pages (11528 KiB) for 2160 lines
:
Frame 16 maps 2882 unique pages (11528 KiB) for 2160 lines
Frame buffers request mapping of 46112 pages from which are 46112 unique (184448 KiB)
Program for DMA processor takes 163 unique pages (652 KiB)


However when I try to match those pages to the iovainfo, only a portion of them actually matches:

IOMMU maps 23272 unique pages (93088 KiB)
Orphaned 89 iommu pages
Orphaned 23041 pages in 16 frames (1395,1579,1140,1476,1466,1674,1475,1447,1314,1520,1464,1494,1246,1502,1432,1417)
Orphaned 51 pages in DMA processor program


With manually checking mappings of some buffers I can see that they are not in the iommu list. But the program still runs and causes no violations, despite the listing of mappings being incomplete. The DMA program area and frame data buffer translations are valid from start of a V4L2 client till the eventual crash.

#13
Posted 11/29/2017 08:14 PM   
[quote=""]TX1 (4.4.38, R28.1) without AFI specified in IOMMU portion of devicetree crashes on this error: [code] [ 785.811564] nvme 0000:03:00.0: Failed status: 3, reset controller [ 785.811674] nvme 0000:03:00.0: Cancelling I/O 863 QID 2 [ 785.811695] nvme 0000:03:00.0: Cancelling I/O 865 QID 2 [ 785.915656] smmu_dump_pagetable(): fault_address=0x0000000080de3e00 pa=0x0000000000000e00 bytes=1000 #pte=764 in L2 [ 785.915666] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry [ 785.915670] mc-err: status = 0x6000000e; addr = 0x80de3e00 [ 785.915675] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s [ 785.915693] smmu_dump_pagetable(): fault_address=0x0000000080de3e00 pa=0x0000000000000e00 bytes=1000 #pte=764 in L2 [ 785.915697] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry [ 785.915701] mc-err: status = 0x6000000e; addr = 0x80de3e00 [ 785.915705] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s [ 785.915718] smmu_dump_pagetable(): fault_address=0x0000000080de3e00 pa=0x0000000000000e00 bytes=1000 #pte=764 in L2 [ 785.915723] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry [ 785.915726] mc-err: status = 0x6000000e; addr = 0x80de3e00 [ 785.915730] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s [ 785.915744] smmu_dump_pagetable(): fault_address=0x0000000080de3e00 pa=0x0000000000000e00 bytes=1000 #pte=764 in L2 [ 785.915748] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry [ 785.915751] mc-err: status = 0x6000000e; addr = 0x80de3e00 [ 785.915755] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s [ 785.915765] mc-err: Too many MC errors; throttling prints [ 786.799547] nvme 0000:03:00.0: Failed status: 3, reset controller [/code] The address 0x80de3e00 does not belong to our V4L2 driver (anyway by the nature of the error message it seems to be internal to SMMU) [/quote] In this, how did you confirm that SMMU is disabled for PCIe? Can you please paste the output of 'ls /sys/kernel/debug/12000000.iommu/masters/' ?
said:TX1 (4.4.38, R28.1) without AFI specified in IOMMU portion of devicetree crashes on this error:

[  785.811564] nvme 0000:03:00.0: Failed status: 3, reset controller
[ 785.811674] nvme 0000:03:00.0: Cancelling I/O 863 QID 2
[ 785.811695] nvme 0000:03:00.0: Cancelling I/O 865 QID 2
[ 785.915656] smmu_dump_pagetable(): fault_address=0x0000000080de3e00 pa=0x0000000000000e00 bytes=1000 #pte=764 in L2
[ 785.915666] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry
[ 785.915670] mc-err: status = 0x6000000e; addr = 0x80de3e00
[ 785.915675] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[ 785.915693] smmu_dump_pagetable(): fault_address=0x0000000080de3e00 pa=0x0000000000000e00 bytes=1000 #pte=764 in L2
[ 785.915697] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry
[ 785.915701] mc-err: status = 0x6000000e; addr = 0x80de3e00
[ 785.915705] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[ 785.915718] smmu_dump_pagetable(): fault_address=0x0000000080de3e00 pa=0x0000000000000e00 bytes=1000 #pte=764 in L2
[ 785.915723] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry
[ 785.915726] mc-err: status = 0x6000000e; addr = 0x80de3e00
[ 785.915730] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[ 785.915744] smmu_dump_pagetable(): fault_address=0x0000000080de3e00 pa=0x0000000000000e00 bytes=1000 #pte=764 in L2
[ 785.915748] mc-err: (0) csr_afir: EMEM decode error on PDE or PTE entry
[ 785.915751] mc-err: status = 0x6000000e; addr = 0x80de3e00
[ 785.915755] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[ 785.915765] mc-err: Too many MC errors; throttling prints
[ 786.799547] nvme 0000:03:00.0: Failed status: 3, reset controller


The address 0x80de3e00 does not belong to our V4L2 driver (anyway by the nature of the error message it seems to be internal to SMMU)

In this, how did you confirm that SMMU is disabled for PCIe?
Can you please paste the output of 'ls /sys/kernel/debug/12000000.iommu/masters/' ?

#14
Posted 12/04/2017 11:24 AM   
[quote="vidyas"]In this, how did you confirm that SMMU is disabled for PCIe? Can you please paste the output of 'ls /sys/kernel/debug/12000000.iommu/masters/' ? [/quote] This was on TX1 (so no 12000000.iommu). When doing this comment: [i]tx1/Linux_for_Tegra_R28.1/sources/hardware/nvidia/soc/t210/kernel-dts/tegra210-soc/tegra210-soc-base.dtsi[/i] [code] domains = <&ppcs_as TEGRA_SWGROUP_CELLS5(PPCS, PPCS1, PPCS2, SE, SE1) &gpu_as TEGRA_SWGROUP_CELLS(GPUB) &ape_as TEGRA_SWGROUP_CELLS(APE) &dc_as TEGRA_SWGROUP_CELLS2(DC, DC12) &dc_as TEGRA_SWGROUP_CELLS(DCB) /* &common_as TEGRA_SWGROUP_CELLS(AFI) */ &common_as TEGRA_SWGROUP_CELLS(SDMMC1A) &common_as TEGRA_SWGROUP_CELLS(SDMMC2A) &common_as TEGRA_SWGROUP_CELLS(SDMMC3A) &common_as TEGRA_SWGROUP_CELLS(SDMMC4A) &common_as TEGRA_SWGROUP_CELLS(AVPC) &common_as TEGRA_SWGROUP_CELLS(SMMU_TEST) &common_as 0xFFFFFFFF 0xFFFFFFFF>; [/code] Then the difference in 'ls /sys/kernel/debug/70019000.iommu/masters/' is unexpectedly: [code] # diff -u ls-tx1-with-iommu.txt ls-tx1-without-iommu.txt --- ls-tx1-with-iommu.txt 2017-12-04 11:40:52.392434254 +0000 +++ ls-tx1-without-iommu.txt 2017-12-04 11:44:34.291117256 +0000 @@ -37,7 +37,6 @@ sdhci-tegra.3 serial8250 smmu_test -snd-soc-dummy sound tegra-carveouts tegradc.1 [/code] I suspect our TX1/noiommu test results are false then. [b]How to disable TX1 pcie from iommu?[/b] When we did your comment on TX2 device tree, in dmesg we saw a missing: [code][ 0.240076] iommu: Adding device 10003000.pcie-controller to group 52 [/code] and also the iommu groups were reduced from 0..66 to 0..55 due to not appling iommu rules on any of the device in our PCI tree.
vidyas said:In this, how did you confirm that SMMU is disabled for PCIe?
Can you please paste the output of 'ls /sys/kernel/debug/12000000.iommu/masters/' ?


This was on TX1 (so no 12000000.iommu). When doing this comment:

tx1/Linux_for_Tegra_R28.1/sources/hardware/nvidia/soc/t210/kernel-dts/tegra210-soc/tegra210-soc-base.dtsi
domains = <&ppcs_as TEGRA_SWGROUP_CELLS5(PPCS, PPCS1, PPCS2, SE, SE1)
&gpu_as TEGRA_SWGROUP_CELLS(GPUB)
&ape_as TEGRA_SWGROUP_CELLS(APE)
&dc_as TEGRA_SWGROUP_CELLS2(DC, DC12)
&dc_as TEGRA_SWGROUP_CELLS(DCB)
/*
&common_as TEGRA_SWGROUP_CELLS(AFI)
*/
&common_as TEGRA_SWGROUP_CELLS(SDMMC1A)
&common_as TEGRA_SWGROUP_CELLS(SDMMC2A)
&common_as TEGRA_SWGROUP_CELLS(SDMMC3A)
&common_as TEGRA_SWGROUP_CELLS(SDMMC4A)
&common_as TEGRA_SWGROUP_CELLS(AVPC)
&common_as TEGRA_SWGROUP_CELLS(SMMU_TEST)
&common_as 0xFFFFFFFF 0xFFFFFFFF>;


Then the difference in 'ls /sys/kernel/debug/70019000.iommu/masters/' is unexpectedly:
# diff -u ls-tx1-with-iommu.txt ls-tx1-without-iommu.txt
--- ls-tx1-with-iommu.txt 2017-12-04 11:40:52.392434254 +0000
+++ ls-tx1-without-iommu.txt 2017-12-04 11:44:34.291117256 +0000
@@ -37,7 +37,6 @@
sdhci-tegra.3
serial8250
smmu_test
-snd-soc-dummy
sound
tegra-carveouts
tegradc.1


I suspect our TX1/noiommu test results are false then.
How to disable TX1 pcie from iommu?

When we did your comment on TX2 device tree, in dmesg we saw a missing:

[    0.240076] iommu: Adding device 10003000.pcie-controller to group 52

and also the iommu groups were reduced from 0..66 to 0..55 due to not appling iommu rules on any of the device in our PCI tree.

#15
Posted 12/04/2017 12:13 PM   
Scroll To Top

Add Reply