Files
linux-nv-oot/drivers/pci/controller
Nagarjuna Kristam e94216eeb9 pci: tegra-edma: add check for irq sync
Issue:
When AER error received during PCIe driver(c2C) probe, it results in
EDMA driver de-initing and AER handling happening at same time. PCIe
driver probe happens inside device lock and so is AER handling.
synchronize_irq() done by EDMA driver causes dead lock with AER device
lock.

Trace at RP hot-plug:
[  605.149061] INFO: task irq/344-tegra_p:394 blocked for more than 120 seconds.
<3>[  605.149066]       Not tainted 5.10.120-rt70-tegra #1
<3>[  605.149068] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[  605.149070] task:irq/344-tegra_p state:D
<6>[  605.149071] sched_debug_info: last_run_cpu 0 wake_cpu 0
<6>[  605.149073]  on_cpu 0 on_rq 0 migrate_dis 0
<4>[  605.149074]  stack:    0 pid:  394 ppid:     2 flags:0x00000028
<6>[  605.149077] Call trace:
<6>[  605.149078]  __switch_to+0xc8/0x120
<6>[  605.149090]  __schedule+0x334/0x930
<6>[  605.149095]  schedule+0x64/0x120
<6>[  605.149097]  synchronize_irq+0x8c/0xc0
<6>[  605.149101]  edma_stop+0x1ac/0x320
<6>[  605.149105]  tegra_pcie_edma_deinit+0x60/0x170
<6>[  605.149106]  nvscic2c_pcie_epc_probe+0x3ec/0x4a0 [nvscic2c_pcie_epc]
<6>[  605.149120]  pci_device_probe+0xe8/0x1a0
<6>[  605.149123]  really_probe+0xf8/0x3d0
<6>[  605.149126]  driver_probe_device+0x60/0xc0
<6>[  605.149128]  __device_attach_driver+0x8c/0xd0
<6>[  605.149130]  bus_for_each_drv+0x8c/0xe0
<6>[  605.149133]  __device_attach+0xf8/0x160
<6>[  605.149135]  device_attach+0x28/0x40
<6>[  605.149137]  pci_bus_add_device+0x5c/0xc0
<6>[  605.149141]  pci_bus_add_devices+0x40/0x90
<6>[  605.149142]  pci_bus_add_devices+0x6c/0x90
<6>[  605.149143]  pci_host_probe+0x50/0xd0
<6>[  605.149145]  dw_pcie_host_init+0x1c0/0x420
<6>[  605.149148]  tegra_pcie_config_rp+0x78/0x250
<6>[  605.149151]  tegra_pcie_prsnt_irq+0xb0/0x120

Trace at AER ISR at the same time:
<6>[  605.149335] Call trace:
<6>[  605.149336]  __switch_to+0xc8/0x120
<6>[  605.149339]  __schedule+0x334/0x930
<6>[  605.149342]  schedule+0x64/0x120
<6>[  605.149344]  __rt_mutex_slowlock+0xc4/0x150
<6>[  605.149346]  rt_mutex_slowlock_locked+0xb0/0x230
<6>[  605.149348]  rt_mutex_slowlock+0x88/0xf0
<6>[  605.149350]  __rt_mutex_lock_state+0x64/0xa0
<6>[  605.149352]  _mutex_lock_blk_flush+0x58/0x80
<6>[  605.149356]  _mutex_lock+0x28/0x40
<6>[  605.149358]  report_error_detected+0x34/0x120
<6>[  605.149361]  report_frozen_detected+0x30/0x40
<6>[  605.149363]  pci_walk_bus+0x68/0xc0
<6>[  605.149366]  pcie_do_recovery+0x154/0x1d0
<6>[  605.149368]  aer_process_err_devices+0xec/0x110
<6>[  605.149372]  aer_isr+0x154/0x1d0
<6>[  605.149374]  irq_thread_fn+0x34/0xa0
<6>[  605.149376]  irq_thread+0x188/0x280
<6>[  605.149379]  kthread+0x16c/0x1a0
<6>[  605.149381]  ret_from_fork+0x10/0x18

Fix:
Perform synchronize_irq() in EDMA de-init, only if there are any
descriptors pending processing. This acts as a WAR to avoid dead
lock during probe and any other shared interrupt handling in device
lock scope.

Bug 4414241

Change-Id: Ie389ebc3b32d6a1121c154ab60d08aa6c3c53e36
Signed-off-by: Nagarjuna Kristam <nkristam@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3042451
(cherry picked from commit 2453fa2e09eafd23570f25091c5c1f9d92ed2aa4)
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3047403
Reviewed-by: Bitan Biswas <bbiswas@nvidia.com>
GVS: Gerrit_Virtual_Submit <buildbot_gerritrpt@nvidia.com>
2024-01-09 18:14:40 -08:00
..