During recovery, we set ch->unserviceable at the end after we preempt
the TSG and reset the engines. It might be too late and user-space
might submit more work to the broken channel which is not desirable.
Move setting this unserviceable flag right at the start
of recovery sequence.
Another thread doing a submit can still read the unserviceable flag
just before it is set here, leaving that submit stuck if recovery
completes before the submit thread advances enough to set up a post
fence visible for other threads. This could be fixed with a big lock
or with a double check at the end of the submit code after the job
data has been made visible.
We still release the fences, semaphore and error notifier wait queues
at the end; so user-space would not trigger channel unbind while
channel is being recovered.
Also, change the handle_mmu_fault APIs to return void as the
debug_dump return value is not used in any of the caller APIs.
JIRA NVGPU-5843
Change-Id: Ib42c2816dd1dca542e4f630805411cab75fad90e
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2385256
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Konsta Holtta <kholtta@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
This patch addresses misra violations due to SDL error reporting
callbacks. In particular, it addresses the following misra violation:
- misra_c_2012_directive_4_7_violation: Calling function
"nvgpu_report_*_err()" which returns error information without testing
the error information.
JIRA NVGPU-4025
Change-Id: Ia10b6b3fd9c127a8c5189c3b6ba316f243cedf04
Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2196895
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
This patch moves all the safe static and non-static functions as well
as its dependencies such as static declared structs into files with
_fusa.c extension. If the original file is left with no functions
remaining then the file is deleted.
Added changes in Makefile, Makefile.sources, nvgpu-hal-new.yaml for
compilation.
Jira NVGPU-3690
Change-Id: I81af67c308705faf8a681df63a6778e7de2076cf
Signed-off-by: Debarshi Dutta <ddutta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2146761
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: Sagar Kamble <skamble@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
This patch moves gops init related to SDL from qnx to common-core. For this
purpose, it does the following changes:
- Adds stub functions for linux and posix.
- Updates nvgpu_init.c for mapping err_ops with report error APIs.
- Updates nvgpu_err.h header file to include prototypes related to error
reporting APIs.
- Updates nvgpu-linux.yaml file to include sdl_stub file.
Jira NVGPU-3237
Change-Id: Idbdbe6f8437bf53504b29dc2d50214484ad18d6f
Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2119681
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
pbdma fault recovery function reads pbdma status info to retrieve
channel id, tsg id and engine id. pbdma interrupts can only be cleared
after that information has been read otherwise because pbdma exits
from stall state, channel/tsg/engine could have changed and fault
recovery function reads information different from that when interrupt
is issued.
Bug 2123866
Change-Id: Ia0e0462ae02ec89a333c81bd933a74fbae8ae1e7
Signed-off-by: Peng Liu <pengliu@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2123774
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Moved
-mmu_fault_pending mm ops to is_mmu_fault_pending mc ops
-mmu_fault_pending fb ops to is_mmu_fault_pending fb.intr ops. This
is needed to check if mmu fault intr is pending for volta onwards.
Added
is_mmu_fault_pending fifo ops. This is needed to check if mmu fault
interrupt is pending for chips prior to volta
JIRA NVGPU-1313
Change-Id: Ie8e778387cd486cb19b18c4aee734c581dcd9229
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2094895
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
- Enable the reporting of PFIFO related errors such as engine syncpoint error,
memop timeout error, lb error to 3LSS framework.
- Remove the reporting of bind_error from gk20a since we already report it
from gv11b related fifo hal file.
Jira NVGPU-3087
Change-Id: Ic002be3a12a049010165870b861cdfb13a7f33d8
Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2088579
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Moved below hals from {chip}/fifo_{chip}.[ch] to hal/fifo
get_mmu_fault_info
get_mmu_fault_desc
get_mmu_fault_client_desc
get_mmu_fault_gpc_desc
Moved gk20a_fifo_handle_dropped_mmu_fault to hal/fifo
JIRA NVGPU-1313
Change-Id: I949bcd482156c6e381006387372f13770277e8c5
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2083287
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
RC_TYPE_PBDMA_FAULT is the only recovery type for all the pbdma intr
functions. Thus, rc_type variable is changed to a boolean type
in all implementations of handle_pbdma_intr* functions.
"handled" variable is unused and removed from all the implementations of
handle_pbdma_intr* functions.
handle_pbdma_intr* HAL ops are renamed to handle_intr*.
Jira NVGPU-2950
Change-Id: I9605d930225a38ed76f25b6a94cb02d855f522dd
Signed-off-by: Debarshi Dutta <ddutta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2083748
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
fifo_pbdma_isr is moved to fifo_intr_gk20a HAL unit and renamed to
gk20a_fifo_pbdma_isr.
The pbdma specific handling part of the function
gk20a_fifo_handle_pbdma_intr is now separated into a top level HAL
function named handle_pbdma_intr. This HAL function is implemented
for GM20B and all the other architectures use the same implementation.
handle_pbdma_intr can accept NULL values for the parameters handled and
error_notifier.
gk20a_fifo_handle_pbdma_intr is called from
gv11b_fifo_poll_pbdma_chan_status and gk20a_fifo_pbdma_isr.
The call to gk20a_fifo_handle_pbdma_intr from
gv11b_fifo_poll_pbdma_chan_status doesn't progress to recovery.
Thus, the function gk20a_fifo_handle_pbdma_intr is removed to decouple
pbdma handling from recovery. gv11b_fifo_poll_pbdma_chan_status now
directly calls the HAL handle_pbdma_intr. For gk20a_fifo_pbdma_isr,
rc_type is used to proceed to recovery by calling
gk20a_fifo_pbdma_fault_rc.
gk20a_fifo_pbdma_fault_rc is changed to public from static.
Jira NVGPU-2950
Change-Id: I4f3597aca2317d4b745cd47bab9dd95c927160a9
Signed-off-by: Debarshi Dutta <ddutta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2073535
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Created gr falcon hal unit with moving following hal functions
from gr to gr falcon:
u32 (*fecs_base_addr)(void);
u32 (*gpccs_base_addr)(void);
void (*dump_stats)(struct gk20a *g);
u32 (*fecs_ctxsw_mailbox_size)(void);
u32 (*get_fecs_ctx_state_store_major_rev_id)(struct gk20a *g);
Modified chip hals to populate these new functions and related code
now refers to gr falcon hals.
Modified kernel headers to have following defs for
fecs/gpccs base address in gm20b/gp10b/gv11b/tu104:
static inline u32 gr_fecs_irqsset_r(void);
static inline u32 gr_gpcs_gpccs_irqsset_r(void);
Created base gm20b hals for fecs/gpccs_base_addr and
removed redundant gp106 related hals.
JIRA NVGPU-1881
Change-Id: I16e820cc1c89223f57988f1e5723fd8fdcbfe89d
Signed-off-by: Seshendra Gadagottu <sgadagottu@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2081245
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Delete apply_ctxsw_timeout_intr ops and add
ctxsw_timeout_enable ops
Move chip specific sched_error and ctxsw_timeout
functions to hal/fifo/fifo_intr_* and hal/fifo/ctxsw_timeout_*
Add nvgpu_rc_ctxsw_timeout function under common/rc/rc.c
Do not check ctxsw timeout for channels that are no more
bound to tsg.
JIRA NVGPU-1312
Change-Id: Ide977fb60b3b72a27d9f22873f7a416c3bd1181d
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2075734
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>