We are also getting “arm-smmu 12000000.iommu: Unhandled context fault: iova” errors when trying to access GPU memory about 4GB limit on TX2 with 28.1 L4T kernel. I believe there might be other bugs in the kernel with improper 64bit addressing. The problem can be easily replicated running MemtestG80 (https://github.com/ihaque/memtestG80.git) on TX2, for instance
Attempt to allocate 4100 MB of RAM works fine, but 4200 is failing with the errors below:
#./memtestG80 4100 1
Final error count after 1 iterations over 4100 MiB of GPU memory: 4294967181 errors:
[396634.006720] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x84209100, fsynr=0x13, cb=19, sid=16(0x10 - GPU), pgd=1dc2c9003, pud=1dc2c9003, pmd=1dbd28003, pte=0
[396634.021507] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x87e8ce40, fsynr=0x13, cb=19, sid=16(0x10 - GPU), pgd=1dc2c9003, pud=1dc2c9003, pmd=180276003, pte=0
[396634.036266] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x8bb43440, fsynr=0x13, cb=19, sid=16(0x10 - GPU), pgd=1dc2c9003, pud=1dc2c9003, pmd=180258003, pte=0
[396634.051014] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x0f621780, fsynr=0x3, cb=19, sid=16(0x10 - GPU), pgd=faf57003, pud=faf57003, pmd=226781003, pte=0
[396634.065512] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x9334d880, fsynr=0x13, cb=19, sid=16(0x10 - GPU), pgd=1dc2c9003, pud=1dc2c9003, pmd=18021c003, pte=0
[396634.080276] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x96ff3d40, fsynr=0x13, cb=19, sid=16(0x10 - GPU), pgd=1dc2c9003, pud=1dc2c9003, pmd=1801fe003, pte=0
[396634.095038] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x1aaec5c0, fsynr=0x3, cb=19, sid=16(0x10 - GPU), pgd=faf57003, pud=faf57003, pmd=226727003, pte=0
[396634.109566] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x9e837bc0, fsynr=0x13, cb=19, sid=16(0x10 - GPU), pgd=1dc2c9003, pud=1dc2c9003, pmd=1802b8003, pte=0
[396634.124361] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x22360340, fsynr=0x3, cb=19, sid=16(0x10 - GPU), pgd=faf57003, pud=faf57003, pmd=2266eb003, pte=0
[396634.138900] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x25efa5c0, fsynr=0x3, cb=19, sid=16(0x10 - GPU), pgd=faf57003, pud=faf57003, pmd=2267ae003, pte=0
It’s hard to believe that NVidia didn’t test GPU memory allocations above 4GB on TX2 which is equipped with 8GB… dmesg snippet with errors is attached.
Any suggestions on how it can be solved? Do we need to apply another kernel patch to address it?
-albertr
dmesg.txt (34.7 KB)