Jetson TX2, ioctl(dev->contfd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU) hang up.

Hi,

I am trying to use VFIO device on jetson tx2,and wrote some code refer to: [url]https://www.kernel.org/doc/Documentation/vfio.txt[/url]

I debugged the code:

328 if ((dev->contfd = open(“/dev/vfio/vfio”, O_RDWR)) < 0)
(gdb)
331 if (ioctl(dev->contfd, VFIO_GET_API_VERSION) != VFIO_API_VERSION)
(gdb)
334 if (ioctl(dev->contfd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU) == 0)
(gdb)
337 if ((dev->groupfd = open(path, O_RDWR)) < 0) //path=“dev/vifo/55”
(gdb)
340 if (ioctl(dev->groupfd, VFIO_GROUP_GET_STATUS, &group_status) < 0)
(gdb)
343 if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE))
(gdb)
346 if (ioctl(dev->groupfd, VFIO_GROUP_SET_CONTAINER, &dev->contfd) < 0)
(gdb)
349 if (ioctl(dev->contfd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU) < 0)
(gdb)

In the line 349, the ioctl just hang up there, no error, no return.

And now check the vfio device, the info is: /dev/vfio/55: Device or resource busy

How can i resolve this issue?

Thanks.
yafei

More info about the hang up:

After line 349, continue to use gdb to track the code:

349 if (ioctl(dev->contfd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU) < 0)
(gdb) s
ioctl () at …/sysdeps/unix/sysv/linux/aarch64/ioctl.S:23
23 …/sysdeps/unix/sysv/linux/aarch64/ioctl.S: No such file or directory.
(gdb) n
24 in …/sysdeps/unix/sysv/linux/aarch64/ioctl.S
(gdb) n
25 in …/sysdeps/unix/sysv/linux/aarch64/ioctl.S
(gdb) n

The “25 in …/sysdeps/unix/sysv/linux/aarch64/ioctl.S” is the finally hang up location.

Hi Yafei,

Can you explain the usecase and what all step you have taken to achieve it before you are stuck here?

regards
Bibek

Hi bbasu,

I write a userspace driver in order to communicate with my PCI device via VFIO.

My usecase as below:

  1. Make menuconfig, enabled VFIO and IOMMU (ARM SMMU and SMMUv3) options in Device drivers section
  2. VFIO create reference the VFIO usage example in kernel /Documentation/vfio.txt
  3. Kernel Oops was occurred when implement “ioctl(dev->contfd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU)” (you can reference vfio.txt line 196)

From dmesg I found that my PCI device was attached to IOMMU domain in kernel initialization, so it cannot attached to SMMU domain when send VFIO_SET_IOMMU by ioctl.The SMMU module commented the pci device already attached to IOMMU domain.

the dmesg show the Oops occurred during kree() in SMMU

[ 140.924584] case VFIO_SET_IOMMU.
[ 140.929911] vfio_ioctl_set_iommu start.
[ 140.935871] vfio-pci 0000:01:00.0: already attached to IOMMU domain

[ 140.942167] ------------[ cut here ]------------
[ 140.942171] Kernel BUG at ffffffc0001bb8b8 [verbose debug info unavailable]
[ 140.949122] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 140.954597] Modules linked in: fuse bnep bluetooth bcmdhd pci_tegra bluedroid_pm
[ 140.962060] CPU: 0 PID: 1712 Comm: unvme_sim_test Not tainted 4.4.15 #5
[ 140.968662] Hardware name: quill (DT)
[ 140.972318] task: ffffffc1d1533200 ti: ffffffc1d1400000 task.ti: ffffffc1d1400000
[ 140.979796] PC is at kfree+0x248/0x290
[ 140.983545] LR is at arm_smmu_domain_free+0x120/0x180
[ 140.988587] pc : [] lr : [] pstate: 40000045

Do I need to enable “Kernel-based Virtual Machine (KVM) support” in “Tegra Virtualization Support” ?

Tegra PCI is already behind tegra IOMMU and therefore you get that print.
Since you are not using VM, I am not sure why you need VFIO.
Documentation says:

Why do we want that? Virtual machines often make use of direct device
access (“device assignment”) when configured for the highest possible
I/O performance. From a device and host perspective, this simply
turns the VM into a userspace driver, with the benefits of
significantly reduced latency, higher bandwidth, and direct use of
bare-metal device drivers[3].

We have never tested VFIO locally. I will give it a try.
By the way, is the domain free call is also from your userspace code?
What if you check for smmu_domain pointer before trying to do operation on it?

static void arm_smmu_domain_free(struct iommu_domain *domain)
{
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);

/*
 * Free the domain resources. We assume that all devices have
 * already been detached.
 */
arm_smmu_destroy_domain_context(domain);
arm_smmu_free_pgtables(smmu_domain);
kfree(smmu_domain);

}

Cheers
Bibek

I used a usersapce PCIe device driver in order to communicate with my PCIe storage device bypass the OS kernel, it can reduce lost of SW overhead in kernel especially in block layer.

I used the VFIO container in order to get higher I/O performance and lower overhead, The VFIO device API includes ioctls for describing the device, the I/O regions and their read/write/mmap offsets on the device descriptor, as well as mechanisms for describing and registering interrupt notifications. I used these VFIO API (read/write/mmap) to communicate with my PCIe device.

The userspace code is working well in X86 architechture, and the preconditions in X86 are: 1) CPU support VT-d (Virtualization Technology for Directed I/O), 2) Enable VFIO and IOMMU in kernel config. Now I’m trying to make the userspace code support ARM architechture, and then I faced above issue.

In my userspace code, I’m not call domain free function directly, just in the VFIO create phase, I need send lots of IOCTL to VFIO, and during the VFIO_SET_IOMMU ioctl, the kernel Oops BUG 0 occurred. The detail functions call trace is as below:

VFIO create // in my userspace code
dev->contfd = open(“/dev/vfio/vfio”, O_RDWR) // SUCCESS
ioctl(dev->contfd, VFIO_GET_API_VERSION) // SUCCESS
ioctl(dev->contfd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU) // SUCCESS
dev->groupfd = open(path, O_RDWR) // SUCCESS
ioctl(dev->groupfd, VFIO_GROUP_GET_STATUS, &group_status) // SUCCESS
ioctl(dev->groupfd, VFIO_GROUP_SET_CONTAINER, &dev->contfd) // SUCCESS
ioctl(dev->contfd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU) // kernel Oops BUG 0 occurred, below is kernel functions call trace
----------------------------------// Above operations are in my userspace code ----------------------------------------------------

→ vfio_fops_unl_ioctl() // implement VFIO_SET_IOMMU ioctl
→ vfio_ioctl_set_iommu()
→ __vfio_container_attach_groups()
→ vfio_iommu_type1_attach_group() //kzalloc() for the domain success
→ iommu_attach_group()
→ iommu_group_do_attach_device()
→ arm_smmu_attach_dev() //in Arm-smmu.c “already attached to IOMMU domain” error occurred in this function and return -EEXIST
→ iommu_domain_free() // in Vfio_iommu_type1.c
→ arm_smmu_domain_free() //kfree() for the domain which kzalloc in vfio_iommu_type1_attach_group(), and kernel Oops BUG 0 occurred at the kfree() implementation.

Right.
And the problem seems, vfio is passing iommu_domain inside vfio_domain while arm-smmu is trying to get the pointer of arm_smmu_domain from iommu_domain. Leading to wrong pointer free operation

struct vfio_domain {
struct iommu_domain domain;
struct list_head next;
struct list_head group_list;
int prot; /
IOMMU_CACHE /
bool fgsp; /
Fine-grained super pages */
};

struct arm_smmu_domain {
struct arm_smmu_device *smmu;
struct arm_smmu_cfg cfg;
spinlock_t lock;

dma_addr_t			inquired_iova;
phys_addr_t			inquired_phys;

struct iommu_domain             domain;

};

Yes, that’s right, looks like the kernel free the wrong pointer and then the Oops occurred.I think the more important clue is the device attached to IOMMU domain, it caused this Oops.

I got a ARM VFIO test code from github, you can get it from GitHub - virtualopensystems/vfio-host-test: A test case for VFIO_PLATFORM currently based on the PL330 DMA controller. The effort on VFIO_PLATFORM has been partially funded by the SAVE FP7 project.
This is the document for above test code: Testing VFIO platform with the PL330 DMA Controller on ARM
However, when I run this test code in TX2, all of VFIO platform devices already attached to IOMMU domain commentted by kernel dmesg, same issue with my userspace code.

So that, I think the device should not attached to IOMMU domain in kernel initialization, it should be attached to SMMU domain in vfio create code (my code or above test code)

Could you spent time to check how above code can running well in TX2?

Now all of my jobs are suspended by this issue, I really need Nvidia’s help, thank you!

SMMU for PCIe can be disabled with the following patch

diff --git a/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi b/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi

index 3fa383e…6541b9d 100644

— a/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi

+++ b/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi

@@ -1476,8 +1476,6 @@

     interrupt-map-mask = <0 0 0 0>;

     interrupt-map = <0 0 0 0 &intc 0 72 0x04>;// check this
  •    iommus = <&smmu TEGRA_SID_AFI>;
    
     bus-range = <0x00 0xff>;

     #address-cells = <3>;

     #size-cells = <2>;

I checked my r27.1.0_sources.tbz2 which download from Nvidia website, above patch maybe already merged into this package.

vi source/hardware/nvidia/soc/t18x/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi

1406 #interrupt-cells = <1>;
1407 interrupt-map-mask = <0 0 0 0>;
1408 interrupt-map = <0 0 0 0 &intc 0 72 0x04>;// check this
1409
1410 #stream-id-cells = <1>;
1411
1412 bus-range = <0x00 0xff>;
1413 #address-cells = <3>;
1414 #size-cells = <2>;

From above code I cannot find the code of “iommus = <&smmu TEGRA_SID_AFI>;”, does it means the SMMU for PCIe be disabled already?

For rel-27, remove pcie from smmu node in dt

diff --git a/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi b/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi
index 6da7a71…aa44204 100644
— a/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi
+++ b/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi
@@ -182,7 +182,6 @@
<&tegra_adsp_audio TEGRA_SID_APE>,
<&{/sound} TEGRA_SID_APE>,
<&{/sound_ref} TEGRA_SID_APE>,

  •                         <&{/pcie-controller@10003000} TEGRA_SID_AFI>,
                            <&{/ahci-sata@3507000}    TEGRA_SID_SATA2>,
                            <&{/aon@c160000}          TEGRA_SID_AON>,
                            <&{/rtcpu@b000000}        TEGRA_SID_RCE>,
    

I merged above patch into my kernel source code, and I found that there is no iommu_group for PCIe device in /sys/kernel/iommu_groups, so that I can not bind PCIe device to VFIO.

Hi bbasu,

I clarify my usecase as below:
I used one user space PCIe device(NVMe SSD) driver via VFIO,I need unbind my PCIe device original driver(NVMe driver) at first, and then bind the device to vfio-pci driver to create the VFIO group character device,once binding success my user space code can communicate with my PCIe device directly. However, after the TX2 booting my PCIe device was attached to the SMMU domain by default, and my user space code can not attach it once again, and also cannot detach (call kernel detach API).
For your last respond, I merged this patch into my dtsi file, my PCIe device was not attached to the SMMU domain during OS booting, but my user space code cannot bind it to vfio-pci driver. When I checked /sys/bus/pci/devices/0000:01:00.0/, no iommu_group in this directory, So, I think the “pcie-controller@10003000” cannot remove from the “smmu: iommu@12000000” node, and the pcie device should not attached to SMMU domain by board during the booting, it should attached to SMMU domain by my user space code “ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU); /* Enable the IOMMU model we want */”.
My request: Could u tell me where or which file reference the pcie device attaching during the board booting? in bootloader?
I’m looks forward your response, thank you!

Why DON’T u response me? Now all of my jobs are suspended by this issue, I must fix this issue ASAP, pls check this issue, thanks! I appreciate it.

Any Nvidia engineer can give me some hints or suggestions? Thank you!

Hi yafei,

The VFIO is not a supported feature in current BSP, and no schedule to do further investigation yet.

Thanks

As we have working very hard on this issue for almost a month, please let me cry for 5 minutes…/(ㄒoㄒ)/~~