gpu: nvgpu: Trigger quiesce on spurious FBPA intr

In Bug 200588835, the spurious FBPA interrupts are seen on couple
of boards. These interrupts were found to be EDC (Error detection
and Correction) interrupts which are triggered due to ECC errors.
The EDC registers are not exposed to the driver, so the interrupt
status register cannot be cleared; resulting in interrupt storm.
Also, it was concluded that only bad HW can cause this failure
scenario. So, in the ISR for FBPA interrupts, get the GPU into
quiesce state as we don't expect the GPU to be in usable state post
such unrecoverable errors.

Adapt the quiesce code for Linux build too.
1. On Linux, we cannot exit the nvgpu process after quiesce like we
   do on QNX. So, add nvgpu_disable_irqs() call to quiesce
   implementation which is done as part of process exit handler on
   QNX. Masking interrupts which is already done as part of quiesce
   would be sufficient in most cases, but to be fail-safe
   disable_irqs too.
3. Also, the IOCTL code looks at g->sw_ready, hence add
   nvgpu_start_gpu_idle() to set g->sw_ready to false along with
   setting NVGPU_DRIVER_IS_DYING = true.

We expect the nvgpu_sw_quiesce() call to finish before quiesce thread
wakes up from 50ms sleep. Hence, critical step like
nvgpu_start_gpu_idle() is added to nvgpu_sw_quiesce(), whereas the
somewhat redundant disable IRQs call is added to quiesce thread.

nvgpu_fifo_quiesce() was called twice by mistake; remove one of the
them.

Bug 2919899
Bug 200588835

Change-Id: I9beec688c2e1c0d8dfc1327ddf122684576f8684
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2354537
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
This commit is contained in:
Tejal Kudav
2020-06-02 12:22:39 +00:00
committed by Alex Waterman
parent 6778fc9eb6
commit 4dcfbc19de
2 changed files with 9 additions and 3 deletions

View File

@@ -87,9 +87,10 @@ static int nvgpu_sw_quiesce_thread(void *data)
}
nvgpu_err(g, "SW quiesce thread running");
nvgpu_msleep(NVGPU_SW_QUIESCE_TIMEOUT_MS);
nvgpu_fifo_sw_quiesce(g);
nvgpu_disable_irqs(g);
nvgpu_channel_sw_quiesce(g);
nvgpu_bug_exit(1);
@@ -198,6 +199,7 @@ void nvgpu_sw_quiesce(struct gk20a *g)
nvgpu_cond_signal_interruptible(&g->sw_quiesce_cond);
gk20a_mask_interrupts(g);
nvgpu_start_gpu_idle(g);
nvgpu_fifo_sw_quiesce(g);
}