Rework regops execution API to accomodate below updates for new
profiler design
- gops.regops.exec_regops() should accept TSG pointer instead of
channel pointer.
- Remove individual boolean parameters and add one flag field.
Below new flags are added to this API :
NVGPU_REG_OP_FLAG_MODE_ALL_OR_NONE
NVGPU_REG_OP_FLAG_MODE_CONTINUE_ON_ERROR
NVGPU_REG_OP_FLAG_ALL_PASSED
NVGPU_REG_OP_FLAG_DIRECT_OPS
Update other APIs, e.g. gr_gk20a_exec_ctx_ops() and validate_reg_ops()
as per new API changes.
Add new API gk20a_is_tsg_ctx_resident() to check context residency
from TSG pointer.
Convert gr_gk20a_ctx_patch_smpc() to a HAL gops.gr.ctx_patch_smpc().
Set this HAL only for gm20b since it is not required for later chips.
Also, remove subcontext code from this function since gm20b does not
support subcontext.
Remove stale comment about missing vGPU support in exec_regops_gk20a()
Bug 2510974
Jira NVGPU-5360
Change-Id: I3c25c34277b5ca88484da1e20d459118f15da102
Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2389733
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
- NvRmGpuDeviceSetSmDebugMode uses regops interface.
- NvRmGpuDeviceTriggerSuspend, NvRmGpuDeviceWaitForPause,
and NvRmGpuDeviceResumeFromPause should return error on Pascal+. Use
regops interface to suspend/resume.
- On non-cilp devices(Maxwell), NvRmGpuDeviceTriggerSuspend,
NvRmGpuDeviceWaitForPause, NvRmGpuDeviceResumeFromPause and
NvRmGpuDeviceSetSmDebugMode are used when debugger(including coredump,
memcheck) is attached or when CUDA application uses a syscall that
requires traphandler(assert, cnp).
Bug 2558022
Bug 2559631
Bug 2706068
JIRA NVGPU-5502
Change-Id: I9eb2ab0c8c75c50f53523d8bf39c75f98b34f3f0
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2376159
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
The driver configures the sm hww global, warp ESR report masks during poweron
as part of gops_gr.gr_init_support. However, during golden context init, these
are overwritten with default entries from sw_ctx_load list; this leaves the
report masks in a state inconsistent with the driver expectation.
The driver should configure the sm hww warp, global ESR report masks during
golden context init and not before it; Hence, move set_hww_esr_report_mask from
power-on path to golden context init.
In addition, update set_hww_esr_report_mask to do RMW, so as to retain the
values loaded from sw_ctx_load list.
Update global ESR report mask to enable all exceptions.
Bug 3029888
Bug 2997718
Change-Id: Id7ad4cff5409982143f49695c95c5e1d1c9fdec9
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2367466
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Add tu104 specific HAL tu104_gr_falcon_ctrl_ctxsw() that processes below
CTXSW methods to start/stop SMPC global mode :
NVGPU_GR_FALCON_METHOD_START_SMPC_GLOBAL_MODE
NVGPU_GR_FALCON_METHOD_STOP_SMPC_GLOBAL_MODE
Add new tu104 specific HAL tu104_gr_update_smpc_global_mode() to trigger
SMPC global mode start/stop using gops.gr.falcon.ctrl_ctxsw().
Update nvgpu_dbg_gpu_ioctl_smpc_ctxsw_mode() to enable/disable SMPC
global mode if channel is not bound to debug session.
Bug 2510974
Bug 2257799
Jira NVGPU-5360
Change-Id: I1f9d8f2a2d30a4738f291db3fc72c400d24f4048
Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2368696
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Below APIs to update hwpm/smpc ctxsw mode take a channel pointer as a
parameter. APIs then extract corresponding TSG from channel and perform
various operations on context stored in TSG.
g->ops.gr.update_smpc_ctxsw_mode()
g->ops.gr.update_hwpm_ctxsw_mode()
Update both above APIs to accept TSG pointer instead of a channel.
This is a refactor work to support new profiler design where a profiler
object is bound to TSG and keeps track of TSG only.
Bug 2510974
Jira NVGPU-5360
Change-Id: Ia4cefda503d8420f2bd32d07c57534924f0f557a
Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2366122
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
quad type reg_ops were only needed on Kepler, and not for any other chip
beginning Maxweel.
HAL g->ops.gr.access_smpc_reg() was incorrectly set for Volta and Turing
whereas it was only applicable to Kepler. Delete it.
There is no register in the quad type whitelist since the type itself is
not supported anymore. Remove the empty whitelists for all chips and
also delete below HALs:
g->ops.regops.get_qctl_whitelist()
g->ops.regops.get_qctl_whitelist_count()
hal/regops/regops_gv100.* files are not used anymore. Delete the files
instead of just deleting quad HALs in these files.
Bug 200628391
Change-Id: I4dcc04bef5c24eb4d63d913f492a8c00543163a2
Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2366035
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
This patch removes the reporting of _ECC_CORRECTED errors which are
not applicable to GV11B. Specifically, this patch removes the code
related to the reporting of the following service IDs:
NVGUARD_SERVICE_IGPU_SM_SWERR_LRF_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_SM_SWERR_CBU_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_PMU_SWERR_FALCON_DMEM_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_GPCCS_SWERR_FALCON_DMEM_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_FECS_SWERR_FALCON_DMEM_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_GCC_SWERR_L15_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_MMU_SWERR_L1TLB_FA_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_MMU_SWERR_L1TLB_SA_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_HUBMMU_SWERR_L2TLB_SA_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_HUBMMU_SWERR_TLB_SA_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_HUBMMU_SWERR_PTE_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_HUBMMU_SWERR_PDE0_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_SM_SWERR_ICACHE_L0_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_SM_SWERR_L1_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_SM_SWERR_ICACHE_L0_PREDECODE_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_SM_SWERR_ICACHE_L1_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_SM_SWERR_ICACHE_L1_PREDECODE_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_SM_SWERR_L1_TAG_MISS_FIFO_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_SM_SWERR_L1_TAG_S2R_PIXPRF_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_LTC_SWERR_CACHE_TSTG_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_LTC_SWERR_CACHE_RSTG_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_LTC_SWERR_CACHE_DSTG_BE_ECC_CORRECTED
Bug 200616002
Change-Id: I199c396f9f6a6be007bd6d3c556199b5a73c3c91
Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2349587
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Currently, nvgpu_writel_loop() writes to a register and immediately
checks if register value is updated. It might take some time for
hardware registers to get updated with value written by software.
Modify nvgpu_writel_loop() to accept number of retries to check if
register value is updated and assert with nvgpu_assert().
Also, move nvgpu_writel_loop() to common code and use generic
nvgpu_readl() and nvgpu_writel() APIs.
JIRA NVGPU-5490
Change-Id: Iaaf24203a91eee3d05de7d0c7dea18113367de5f
Signed-off-by: Vedashree Vidwans <vvidwans@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2348628
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
The gk20a_debug_dump() function implicitly adds a newline since it
uses nvgpu_err() under the hood (for uart destined prints). For the
seq_file destined writes it does not so there is an annoying inconsistency.
Remove the newline that many of the gk20a_debug_dump() calls add and add
the newline to the (now) seq_printf() call. This reduces the length of
debug dump logs and speeds them up - UART is _very_ slow after all.
Also cleanup some formatting issues in the various debug prints I
happened to notice.
JIRA NVGPU-5541
Change-Id: Iabf853d5c50214794fc4cbb602dfffabeb877132
Signed-off-by: Alex Waterman <alexw@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2347956
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: Konsta Holtta <kholtta@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Enable build flags for dGPU in safety, when
NVGPU_FORCE_DGPU_SAFETY_PROFILE is set.
Use libnvgpu-dgpu_safe.exports for dGPU safety build.
Add build flags for tu104 HAL initialization (to solve
undefined symbols in safety build).
Temporarily add non-fusa files needed to build dGPU in safety.
related functions will have to move to fusa files.
Jira NVGPU-4611
Change-Id: I41db0c039c7f15d9191cdb811b4906e779d5cc88
Signed-off-by: Thomas Fleury <tfleury@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2310276
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Add new hal function
gp10b_gr_intr_handle_class_error.
Update handle_class_error hal function for
gp10b, gv11b and tu104 to
gp10b_gr_intr_handle_class_error from
gm20b_gr_intr_handle_class_error.
gr_trapped_data_mme_pc uses 12 bits from gp10b.
Move gm20b_gr_intr_handle_class_error hal function
to non-fusa section.
Jira NVGPU-4913
Signed-off-by: vinodg <vinodg@nvidia.com>
Change-Id: Ic93013ba43d4bf409527109f2f2d43db11c4238e
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2314249
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Current clk unit has multiple header files under pmuif folder.
This has combination of public struct which is accessed outside the
unit and private struct which is accessed within clk unit.
This patch segregates them based on their accessibility.
All private items are moved into ucode_clk_inf.h from pmuif which only
clk can access.
All public items are moved into include/clk.h which other units can
access
This will help in documentation of items for public items.
NVGPU-4491
Change-Id: Iccb0571e05ecb3cb13363390bed8c7214409b543
Signed-off-by: Abdul Salam <absalam@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2292318
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
From gv11b onwards, FECS ucode returns an ACK for set watchdog
timeout method. Failure to wait for this ACK was leading to races,
and in some cases, the ACK could be mistaken for the reply to the
next method.
In particular, this happened for the discover golden image size
method which is sent after set watchdog timeout.
With instrumented FECS ucode, it takes longer for the code to
process the set watchdog timeout method, and the write to ack
that method could happen after nvgpu driver clears the mailbox to
send the discover image size method.
With an invalid golden context image size, FECS ended up causing
an MMU fault while attempting to save past allocated buffer.
Added NVGPU_GR_FALCON_METHOD_SET_WATCHDOG_TIMEOUT to be used with
gops_gr_falcon.ctrl_ctxsw, and implemented 2 variants:
- gm20b_gr_falcon_ctrl_ctxsw, without ACK
- gv11b_gr_falcon_ctrl_ctxsw, with ACK
Added NVGPU_GR_FALCON_SUBMIT_METHOD_F_LOCKED flag to allow
executing above method without re-acquiring FECS lock. Longer term,
the 'flags' could be added to gop_gr_falcon.ctrl_ctxsw parameters.
Use gops_gr_falcon.ctrl_ctxsw instead of register writes to invoke
set watchdog timeout method in gm20b_gr_falcon_wait_ctxsw_ready.
Also replaced calls to gm20b_gr_falcon_ctrl_ctxsw to
gops_gr.falcon.ctrl_ctxsw when appropriate, since there are
multiple variants (gm20b, gp10b and gv11b).
Last, fixed clearing of mailbox 0 in gm20b_gr_falcon_bind_instblk.
Bug 200586923
Change-Id: I653b9a216555eec8cd4bb01d6f202bc77b75a939
Signed-off-by: Thomas Fleury <tfleury@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2287340
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
For integration testing in safety build, we need to test ctxsw firmware
error codes. inorder to test this, we need some instrumentation in
nvgpu driver.
Added below instrumentation for CTXSW FW error code integration testing.
1. Reduce the ctxsw watchdog time, so ctxsw watchdog always triggered.
Use CONFIG_NVGPU_CTXSW_FW_ERROR_WDT_TESTING for testing watchdog
2. Submit unsupported method to CTXSW FW, it will trigger err.
Use CONFIG_NVGPU_CTXSW_FW_ERROR_CODE_TESTING for testing err code.
JIRA NVGPU-4471
Change-Id: Ib45a946b3e38d3b6dd5bbee277c5d3e7c55521c0
Signed-off-by: sagar <skadamati@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2284048
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: Ankur Kishore <ankkishore@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Below functions in common.gr hal subunits include unnecessary
asserts to ensure value is not truncated when parsing into U32 size.
gm20b_gr_init_commit_global_attrib_cb()
gp10b_gr_init_commit_global_bundle_cb()
gp10b_gr_init_commit_global_pagepool()
gv11b_gr_init_commit_global_attrib_cb()
Make use of nvgpu_safe_cast_u64_to_u32() and remove unnecessary
asserts
gp10b_gr_init_commit_global_bundle_cb() function checks if size <=
U32_MAX value. But since size is declared as u32, it will always be
<= U32_MAX value so there is no point in the check.
Remove unnecessary check.
Jira NVGPU-4778
Change-Id: I9562afd1b31c3c6b095f607cbdf725d33d87effb
Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2279898
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Vinod Gopalakrishnakurup <vinodg@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
To achieve permanent fault coverage, the CTAs launched by
each kernel in the mission and redundant contexts must execute on
different hardware resources. This feature proposes modifications
in the software to modify the virtual SM id to TPC mapping across
the mission and redundant contexts. The virtual SM identifier to TPC
mapping is done by nvgpu when setting up the patch context.
The recommendation for the redundant setting is to offset the
assignment by one TPC, and not by one GPC. This will ensure that both
GPC and TPC diversity. The SM and Quadrant diversity will happen
naturally. For kernels with few CTAs, the diversity is guaranteed
to be 100%. In case of completely random CTA allocation,
e.g. large number of CTAs in the waiting queue, the diversity is
1 - 1/#SM, or 87.5% for GV11B, 97.9% for TU104.
Added NvGpu CFLAGS to enable/disable the SM diversity support
"CONFIG_NVGPU_SM_DIVERSITY".
This support is only enabled on gv11b and tu104 QNX non safety build.
JIRA NVGPU-4685
Change-Id: I8e3eaa72d8cf7aff97f61e4c2abd10b2afe0fe8b
Signed-off-by: Lakshmanan M <lm@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2268026
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: Seshendra Gadagottu <sgadagottu@nvidia.com>
Reviewed-by: Shashank Singh <shashsingh@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Automatic_Commit_Validation_User
GVS: Gerrit_Virtual_Submit
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>