gpu: nvgpu: Add CE interrupt handling

a. LAUNCH_ERR
    - Userspace error.
    - Triggered due to faulty launch.
    - Handle using recovery to reset CE engine and teardown the
      faulty channel.

b. An INVALID_CONFIG -
    - Triggered when LCE is mapped to floorswept PCE.
    - On iGPU, we use the default PCE 2 LCE  HW mapping.
      The default mapping can be read from NV_CE_PCE2LCE_CONFIG
      INIT value in CE refmanual.
    - NvGPU driver configures the mapping on dGPUs (currently only on
      Turing).
    - So, this interrupt can only be triggered if there is
      kernel or HW error
    - Recovery ( which is killing the context + engine reset) will
      not help resolve this error.
    - Trigger Quiesce as part of handling.

c. A MTHD_BUFFER_FAULT -
    - NvGPU driver allocates fault buffers for all TSGs or contexts,
      maps them in BAR2 VA space and writes the VA into channel
      instance block.
    - Can be triggered only due to kernel bug
    - Recovery will not help, need quiesce

d. FBUF_CRC_FAIL
    - Triggered when the CRC entry read from the method fault buffer
      does not match the computed CRC from the methods contained in
      the buffer.
    - This indicates memory corruption and is a fatal interrupt which
      at least requires the LCE to be reset before operations can
      start again, if not the entire GPU.
    - Better to quiesce on memory corruption
      CE Engine reset (via recovery) will not help.

e. FBUF_MAGIC_CHK_FAIL
    - Triggered when the MAGIC_NUM entry read from the method fault
      buf does not match NV_CE_MTHD_BUFFER_GLOBAL_HDR_MAGIC_NUM_VAL
    - This indicates memory corruption and is a fatal interrupt
    - Better to quiesce on memory corruption

f. STALLING_DEBUG
    - Only triggered with SW write for debug purposes
    - Debug interrupt, currently ignored

Move launch error handling from GP10b to GV11b HAL as -
1. LAUNCHERR_REPORT errcode METHOD_BUFFER_ACCESS_FAULT is not
   defined on Pascal
2. We do not support GP10b on dev-main ToT

JIRA NVGPU-8102

Change-Id: Idc84119bc23b5e85f3479fe62cc8720e98b627a5
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2678893
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
This commit is contained in:
Tejal Kudav
2022-03-09 12:40:14 +00:00
committed by mobile promotions
parent 15739c52e9
commit b80b2bdab8
35 changed files with 246 additions and 144 deletions

View File

@@ -98,8 +98,8 @@ gv100_dump_engine_status
gv100_read_engine_status_info
gv11b_ce_get_num_pce
gv11b_ce_init_prod_values
gv11b_ce_mthd_buffer_fault_in_bar2_fault
gv11b_ce_stall_isr
gv11b_ce_get_inst_ptr_from_lce
gv11b_channel_count
gv11b_channel_read_state
gv11b_channel_reset_faulted
@@ -275,6 +275,7 @@ nvgpu_bug_unregister_cb
nvgpu_can_busy
nvgpu_ce_engine_interrupt_mask
nvgpu_ce_init_support
nvgpu_ce_stall_isr
nvgpu_cg_blcg_fb_load_enable
nvgpu_cg_blcg_ltc_load_enable
nvgpu_cg_blcg_fifo_load_enable
@@ -792,6 +793,7 @@ nvgpu_rc_gr_fault
nvgpu_rc_sched_error_bad_tsg
nvgpu_rc_tsg_and_related_engines
nvgpu_rc_mmu_fault
nvgpu_rc_ce_fault
nvgpu_init_pramin
gk20a_bus_set_bar0_window
nvgpu_pramin_ops_init

View File

@@ -98,8 +98,8 @@ gv100_dump_engine_status
gv100_read_engine_status_info
gv11b_ce_get_num_pce
gv11b_ce_init_prod_values
gv11b_ce_mthd_buffer_fault_in_bar2_fault
gv11b_ce_stall_isr
gv11b_ce_get_inst_ptr_from_lce
gv11b_channel_count
gv11b_channel_read_state
gv11b_channel_reset_faulted
@@ -283,6 +283,7 @@ nvgpu_bug_unregister_cb
nvgpu_can_busy
nvgpu_ce_engine_interrupt_mask
nvgpu_ce_init_support
nvgpu_ce_stall_isr
nvgpu_cg_blcg_fb_load_enable
nvgpu_cg_blcg_ltc_load_enable
nvgpu_cg_blcg_fifo_load_enable
@@ -811,6 +812,7 @@ nvgpu_rc_gr_fault
nvgpu_rc_sched_error_bad_tsg
nvgpu_rc_tsg_and_related_engines
nvgpu_rc_mmu_fault
nvgpu_rc_ce_fault
gp10b_priv_ring_isr_handle_0
gp10b_priv_ring_isr_handle_1
nvgpu_cic_mon_setup