In Bug 200588835, the spurious FBPA interrupts are seen on couple
of boards. These interrupts were found to be EDC (Error detection
and Correction) interrupts which are triggered due to ECC errors.
The EDC registers are not exposed to the driver, so the interrupt
status register cannot be cleared; resulting in interrupt storm.
Also, it was concluded that only bad HW can cause this failure
scenario. So, in the ISR for FBPA interrupts, get the GPU into
quiesce state as we don't expect the GPU to be in usable state post
such unrecoverable errors.
Adapt the quiesce code for Linux build too.
1. On Linux, we cannot exit the nvgpu process after quiesce like we
do on QNX. So, add nvgpu_disable_irqs() call to quiesce
implementation which is done as part of process exit handler on
QNX. Masking interrupts which is already done as part of quiesce
would be sufficient in most cases, but to be fail-safe
disable_irqs too.
3. Also, the IOCTL code looks at g->sw_ready, hence add
nvgpu_start_gpu_idle() to set g->sw_ready to false along with
setting NVGPU_DRIVER_IS_DYING = true.
We expect the nvgpu_sw_quiesce() call to finish before quiesce thread
wakes up from 50ms sleep. Hence, critical step like
nvgpu_start_gpu_idle() is added to nvgpu_sw_quiesce(), whereas the
somewhat redundant disable IRQs call is added to quiesce thread.
nvgpu_fifo_quiesce() was called twice by mistake; remove one of the
them.
Bug 2919899
Bug 200588835
Change-Id: I9beec688c2e1c0d8dfc1327ddf122684576f8684
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2354537
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Previously, unit interrupt enabling/disabling and corresponding MC level
interrupt enabling/disabling was not done at the same time.
With this change, stall and nonstall interrupt for units are programmed
at MC level along with individual unit interrupts. Kept access to MC
interrupt registers through mc.intr_lock spinlock.
For doing this separated CE and GR interrupt mask functions.
mc.intr_enable is only used when there is global interrupt
control to be set. Removed mc_gp10b.c as mc_gp10b_intr_enable
is now removed. Removed following functions - mc_gv100_intr_enable,
mc_gv11b_intr_enable & intr_tu104_enable. Removed intr_pmu_unit_config
as we can use the generic unit interrupt control function.
JIRA NVGPU-4336
Change-Id: Ibd296d4a60fda6ba930f18f518ee56ab3f9dacad
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2196178
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
To enable ecc interrupts early during nvgpu_finalize_poweron, ecc
support has to be enabled early. ecc support was being initialized
together for GR, LTC, PMU, FB units late in the poweron sequence.
Move the ecc init for each unit to respective unit's init functions.
And separate out the hal ecc functions from GR ecc unit to
respective hal units.
JIRA NVGPU-4336
Change-Id: I2c42fb6ba3192dece00be61411c64a56ce16740a
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2239153
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>