Background: In case of a deferred suspend implemented by gk20a_idle,
the device waits for a delay before suspending and invoking
power gating callbacks. This helps minimize resume latency for any
resume calls(gk20a_busy) that occur before the delay.
Now, some APIs spread across the driver requires that if the device
is powered on, then they can proceed with register writes, but if its
powered off, then it must return. Examples of such APIs include
l2_flush, fb_flush and even nvs_thread. We have relied on
some hacks to ensure the device is kept powered on to prevent any such
delayed suspension to proceed. However, this still raced for some calls
like ioctl l2_flush, so gk20a_busy() was added (Refer to commit Id
dd341e7ecbaf65843cb8059f9d57a8be58952f63)
Upstream linux kernel has introduced the API pm_runtime_get_if_active
specifically to handle the corner case for locking the state during the
event of a deferred suspend.
According to the Linux kernel docs, invoking the API with
ign_usage_count parameter set to true, prevents an incoming suspend
if it has not already suspended.
With this, there is no longer a need to check whether
nvgpu_is_powered_off(). Changed the behavior of gk20a_busy_noresume()
to return bool. It returns true, iff it managed to prevent
an imminent suspend, else returns false. For cases where
PM runtime is disabled, the code follows the existing implementation.
Added missing gk20a_busy_noresume() calls to tlb_invalidate.
Also, moved gk20a_pm_deinit to after nvgpu_quiesce() in
the module removal path. This is done to prevent regs access
after registers are locked out at the end of nvgpu_quiesce. This
can happen as some free function calls post quiesce might still
have l2_flush, fb_flush deep inside their stack, hence invoke
gk20a_pm_deinit to disable pm_runtime immediately after quiesce.
Kept the legacy implementation same for VGPU and
older kernels
Jira NVGPU-8487
Signed-off-by: Debarshi Dutta <ddutta@nvidia.com>
Change-Id: I972f9afe577b670c44fc09e3177a5ce8a44ca338
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2715654
Reviewed-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
GVS: Gerrit_Virtual_Submit
nvgpu_pmu_rpc_execute takes pmu rpc header address and dereferences
it at address past header based on rpc struct that the header is
part of.
This usage of pointer is not right and confuses CERT checker.
Instead, pass the rpc struct address as char pointer and use
as header or rpc struct as per need.
CID 17141
CID 154223
CID 17557
CID 154226
CID 153904
CID 153926
CID 153929
CID 153925
CID 153925
CID 225346
CID 225355
CID 225356
CID 225360
CID 225361
CID 225365
CID 225367
CID 296735
CID 330244
CID 17557
Bug 3512546
Change-Id: I93b154d4321e75c0d2b41f43d7c2b701682962a3
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2710224
Reviewed-by: svcacv <svcacv@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Mahantesh Kumbar <mkumbar@nvidia.com>
Reviewed-by: Sachin Nikam <snikam@nvidia.com>
GVS: Gerrit_Virtual_Submit
This is setting evict_max_ways for L2 cache to the maximum
supported value for safety.
In normal build L2 cache MAX_EVICT_LAST is configure via
KMD and RegOps. RegOps is enabled only on standard build
with CONFIG_DEBUGGER flag. This method we cant use it for
safety build. Safety we can make use of the patch buffer
to patch the register while creating the context.
JIRA NVGPU-8227
Change-Id: Iec5d73197239b9cad31c6b593ca2b87c224aad5e
Signed-off-by: Dinesh T <dt@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2708702
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
In current form, the default domain acts like any schedulable
domain. TSGs are bound to it and it can be enumerated via the
public interfaces.
The new expectation for the default domain is meant to change
from the current form to a pseudo domain that cannot act like
an ordinary domain in other ways, i.e. it must not be reachable
by in particular the domain management API, it can't be removed,
does not show up in lists, and TSGs cannot be explicitly bound to
this domain. It won't participate in round-robin domain scheduling.
It is not really a domain, and acts like one only when activated in
the manual mode.
Following changes are made overall to support the above change in
definition.
1) Domain creation and attaching the domain to the scheduler are now
split into two separate functions. The new default domain (having ID
= UINT64_MAX) is created separately from a static function without
linking it with other domains in the scheduler.
2) struct nvgpu_nvs_scheduler explicitely stores the default domain
to support direct lookups.
3) TSGs are initially not bound to default domain/rl_domain.
Jira NVGPU-8165
Signed-off-by: Debarshi Dutta <ddutta@nvidia.com>
Change-Id: I916d11f4eea5124d8d64176dc77f3806c6139695
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2697477
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
In order to support the concept of the default domain, a new
rl domain is created that shadows all the other domains i.e.
all channels of all TSGs are replicated here. This is scheduled
by default during GPU boot.
1) The shadow rl_domain is constructed during poweron sequence via
nvgpu_runlist_alloc_shadow_rl_domain(). struct nvgpu_runlist
is appended to store this separately as 'shadow_rl_domain'.
This is scheduled in background as long as no other user created
rl domains exist.
2) 'shadow_rl_domain' is scheduled out once user created rl domain
exist. At this point, any updates in the user created rl domains
are synchronized with the 'shadow_rl_domain'. i.e. 'shadow_rl_domain'
is also reconstructed to contain active channels and tsgs from the rl
domain.
3) 'shadow_rl_domain' is scheduled back in when the last user created
rl domain is removed.
4) In future for manual mode, driver shall support explicitely
switching to 'shadow_rl_domain'. Also, we will move to an
implementation where 'shadow_rl_domain' is switched out only when
other domains are actively scheduled. These changes will be
implemented later.
Jira NVGPU-8165
Signed-off-by: Debarshi Dutta <ddutta@nvidia.com>
Change-Id: Ia6a07d6bfe90e7f6c9e04a867f58c01b9243c3b0
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2704702
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Konsta Holtta <kholtta@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
GVS: Gerrit_Virtual_Submit
BVEC changes for nvgpu_rc_pbdma_fault and nvgpu_rc_mmu_fault
started reporting below MISRA issue.
kernel/nvgpu/drivers/gpu/nvgpu/common/fifo/tsg.c:321:
1. misra_c_2012_rule_5_1_violation: Declaration with identifier
"nvgpu_tsg_unbind_channel_check_hw_state", which is ambiguous.
kernel/nvgpu/drivers/gpu/nvgpu/common/fifo/tsg.c:349:
2. other_declaration: The first 31 characters of identifiers
"nvgpu_tsg_unbind_channel_check_ctx_reload" and
"nvgpu_tsg_unbind_channel_check_hw_state" are identical.
Do below renames to fix the issue. Doing both for consistency.
s/nvgpu_tsg_unbind_channel_check_hw_state/nvgpu_tsg_unbind_channel_hw_state_check
s/nvgpu_tsg_unbind_channel_check_ctx_reload/nvgpu_tsg_unbind_channel_ctx_reload_check
JIRA NVGPU-6772
Change-Id: Ib92cabe11c486621351bf15ddb86e20d16d514c4
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2584152
(cherry picked from commit a619f259c6a4ffccb05550767212989af60c2a90)
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2706551
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com>
GVS: Gerrit_Virtual_Submit
As per Safety_Services, a client must perform polling to ensure that the
previously reported errors are cleared at FSI, in case of back-to-back
error reporting. However, to minimize the polling overhead, NvGPU driver
performs polling only when the error to be reported is corrected error
to ensure that it is not overwriting the previously reported
uncorrected/corrected error. In case of uncorrected errors, it will be
reported without doing polling. This situation leads to a failure in
error reporting, when uncorrected errors are reported back-to-back. This
is acceptable for safety builds where SW quiesce will be triggered
immediately after the reporting of first uncorrected error. In case of
other build configurations, MCU/SEH takes the decision on encountering
uncorrected errors. To handle such build configurations, polling is
enabled for all types of errors, in all build configurations.
This patch also removes an unused macro "ERR_TYPE_MASK".
Bug 3622420
Change-Id: I750b0406faec9b229d8d0c74e986807234362cb9
Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2707105
Reviewed-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com>
GVS: Gerrit_Virtual_Submit
Fixed following Coverity Defects:
ioctl_as.c : Bad bit shift operation
mc_tu104.c : Bad bit shift operation
vm.c : Bad bit shift operation
vm_remap.c : Bad bit shift operation
A new linux header file for ilog2 is created.
The files which used the old ilog2 function
have been changed to use the new nvgpu_ilog2
function.
CID 9847922
CID 9869507
CID 9859508
CID 10112314
CID 10127813
CID 10127899
CID 10128004
Signed-off-by: Jinesh Parakh <jparakh@nvidia.com>
Change-Id: Ia201eea7cc426c3d6581e1e5ae3b882dbab3b490
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2700994
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Fix CERT issue in nvgpu_gr_falcon_bind_fecs_elpg where nvgpu_pmu_pg_buf
could return NULL. nvgpu_pmu_pg_buf is called from context where PG
will be enabled hence remove the NULL return logic as it is dead
code.
Replace nvgpu_pmu_pg_buf and nvgpu_pmu_pg_buf_get_cpu_va functions by
new function nvgpu_pmu_pg_buf_alloc.
CID 17860
Bug 3512546
Change-Id: I09820a966dadeb258167ce1433ca256f94845896
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2692466
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Update PES, ROP exception handling for NVGPU_ERRATA_3524791. Enable the
errata for all Volta+ chips.
ROP, PES exceptions are being reported using the physical-id,
where logical-id should have been used. All ESR status registers are
reported using logical-id, so this matches with the SW expectation.
To address the (1), update ROP, PES exception handler translate from
physical to logical-id before reading the status registers.
Bug 3524791
Change-Id: Ieacbfb306bb0e69cf0113dc92f18e401573722e3
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2680029
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Volta+ chips supports PES floorsweeping and Ampere+(iGPU) chips supports
ROP floorsweeping. At present, the driver isn't aware of PES, ROP
floorsweeping, make the driver PES, ROP floorsweeping aware by introducing the
following fields in nvgpu_gr_config:
- gpc_(rop/pes)_mask: Contains the bit mask of non FSed ROP/PES units per GPC.
- gpc_(rop/pes)_logical_id_map: Translates per GPC ROP/PES physical id to
logical id.
Introduce the following HAL functions to read PES/ROP FS data:
- gops_fuse.fuse_status_opt_(pes/rop)_gpc: This fuction gets the FS
config from the fuse.
- gops_top.get_max_(pes/rop)_per_gpc: Gets the maximum number of PES/ROP
units that can be present in a GPC.
In addition, introduce the enabled flag NVGPU_SUPPORT_PES_FS to identify chips
which support PES floorsweeping, piggyback on NVGPU_SUPPORT_ROP_IN_GPC
enabled flag to identify ROP floorsweeping.
Bug 3524791
Change-Id: I065bab6c02618fe38892c8c890b069c340b85301
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2679570
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Add new profiler resource type NVGPU_PROFILER_PM_RESOURCE_TYPE_PC_SAMPLER.
Introduce regops HAL get_hwpm_pc_sampler_register_ranges to get
allowlist for PC_SAMPLER resources. Re-generate allowlist files to include
register ranges for PC_SAMPLER resources.
Update uapi header to advertise new resource type
NVGPU_PROFILER_PM_RESOURCE_ARG_PC_SAMPLER.
Bug 3408536
Change-Id: I7009ef822665771eed727da48ef1e89dcc6b9c4b
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2689057
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
GVS: Gerrit_Virtual_Submit
Replace the usage of tegra_fuse_readl with nvmem_cell_read_u32 for the
below fuse registers added as nvmem cells on v5.10+ kernels.
Older nvidia kernels do not have these tegra nvmem cell support.
1. FUSE_GCPLEX_CONFIG_FUSE_0
2. FUSE_RESERVED_CALIB0_0
3. FUSE_PDI0
4. FUSE_PDI1
bug 200633045
Change-Id: I187400720929233fcbc1970c9bbed34347b0a9a7
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2670828
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Jonathan Hunter <jonathanh@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
GVS: Gerrit_Virtual_Submit
- When DISALLOW cmd is sent from driver to PMU the actual
completion of the disallow will be acknowledged by PMU
via a PG EVENT: ASYNC_CMD_RESP.
- Disallow needs a delayed ACK from PMU in order to disable
the ELPG.
- If ELPG is already engaged, the DISALLOW cmd will trigger
ELPG exit and then transition to PMU_PG_STATE_DISALLOW.
- After this whole process is completed, PMU will send
DISALLOW_ACK through ASYNC_CMD_RESP msg.
- After disallow command is sent from the driver, NvGPU driver
waits/polls for disallow command ack. This is sent immediately
by msg framework of PMU.
- Then, the driver will poll/wait for ASYNC_CMD_RESP event which
is the delayed DISALLOW ACK.
- The driver captures the ASYNC_CMD_RESP sent from PMU.
- set disallow_state to ELPG_OFF.
- If the driver does not wait/poll for this delayed disallow
ack from PMU, it can result in erros as PMU is still
processing DISALLOW cmd but the driver progressed further.
Bug 3580271
Change-Id: I332180c05b6a398107f065d54e9718b7038fb1b2
Signed-off-by: Divya <dsinghatwari@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2689500
Reviewed-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
GVS: Gerrit_Virtual_Submit
Update GR suspend routine to clear GR falcon "coldboot_bootstrap_done"
flag, this is needed because GPU power rails are turned off during
suspend cycle due to which GR falcons need to be bootstrapped again
during resume.
Function "nvgpu_gr_falcon_suspend" is added to clear the above mentioned
flag.
Bug 3497398
Bug 3514055
Change-Id: If852a2c09f05c096f287b845c56d8b4f335ec8e7
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2670554
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
GVS: Gerrit_Virtual_Submit
The high level API for the timer unit is the same across all OSs, so
get rid of the slight code duplication by moving the timer init
functions under a new file in common code:
- nvgpu_timeout_init_cpu_timer
- nvgpu_timeout_init_cpu_timer_sw
- nvgpu_timeout_init_retry
Much of the timer logic is also duplicated, but it is mixed between OS
specific current time retrieval. With some refactoring and addition of
an OS independent time keeping layer, that logic could also be made
shared.
Change-Id: I75d02ceb0d32022b0ba7f3bcd9fdb13d47039dbc
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2669510
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Below listed HSI are handled with PMU ISR handler and
all these triggers interrupt from individual unit upon issue.
-Add ECC check for IMEM, DMEM, DCLS, REG, and MPU as per
HSI req
-Add MEMERR check for GPU_PMU_ACCESS_TIMEOUT_UNCORRECTED
PMU HSI id
-Add IOPMP check for GPU_PMU_ILLEGAL_ACCESS_UNCORRECTED
PMU HSI id
-Add WDT check for GPU_PMU_WDT_UNCORRECTED PMU HSI id
Bug 3491596
Bug 3366818
Change-Id: I751d653e447017ac62a2459da2c6bb9da506f438
Signed-off-by: mkumbar <mkumbar@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2686566
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
- Add PMU NVRISCV BR failure HSI support.
- Created a falcon unit function to check for the
BR competition status check and called from
other units as needed.
Bug 3491596
Bug 3366818
Change-Id: I5c3c6a7e6aeaad68f77e6b24f21239e40d9a7f78
Signed-off-by: mkumbar <mkumbar@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2686370
Reviewed-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com>
GVS: Gerrit_Virtual_Submit
In Drive 6.0, the error reporting is supported only for orin (ga10b)
in dev-main. For this purpose, this patch does the following:
- Removes the redundant reporting of following IDs from gv11b:
- GPU_HOST_PFIFO_SCHED_ERROR
- GPU_HOST_PFIFO_CTXSW_TIMEOUT_ERROR
- GPU_HOST_PBDMA_HCE_ERROR
- GPU_MMU_L1TLB_SA_DATA_ECC_UNCORRECTED
- GPU_MMU_L1TLB_FA_DATA_ECC_UNCORRECTED
- GPU_LTC_CACHE_DSTG_ECC_CORRECTED
- GPU_LTC_CACHE_TSTG_ECC_UNCORRECTED
- Migrates the reporting of following IDs from gv11b to ga10b:
- GPU_SM_L1_TAG_ECC_CORRECTED
- GPU_SM_L1_TAG_ECC_UNCORRECTED
- GPU_SM_CBU_ECC_UNCORRECTED
- GPU_SM_LRF_ECC_UNCORRECTED
- GPU_SM_L1_DATA_ECC_UNCORRECTED
- GPU_SM_ICACHE_L1_DATA_ECC_UNCORRECTED
- GPU_SM_ICACHE_L0_PREDECODE_ECC_UNCORRECTED
- GPU_SM_L1_TAG_MISS_FIFO_ECC_UNCORRECTED
- GPU_SM_L1_TAG_S2R_PIXPRF_ECC_UNCORRECTED
- Removes the unused ID that doesn't have any HSI related to it:
- GPU_HOST_PBDMA_PREEMPT_ERROR
In addition to the above, this patch does the following:
- Updates error IDs related to page fault error.
- Updates look-up table to remove unused error IDs.
JIRA NVGPU-8094
Bug 200729736
Change-Id: Ifea76d38ba609c894560e61ff5a6e406290f919e
Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2685249
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Dinesh T <dt@nvidia.com>
Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com>
GVS: Gerrit_Virtual_Submit
-Add check for ECC parity errors in IMEM, DMEM, EMEM, DCLS, REG
for ACR running in GSP engine.
The EXTIRQ3 external interrupt is set from ACR pointing towards host.
-Add function to check error type when ACR or Bootrom execution fails
and report accordingly to SDL with relevant error codes.
This is a part of HSI safety requirements.
Bug 3564039
Jira NVGPU-8108
Change-Id: I65407371f7a1d1ba50a10bdf443ef6b903eeaa36
Signed-off-by: mpoojary <mpoojary@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2678100
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
a. LAUNCH_ERR
- Userspace error.
- Triggered due to faulty launch.
- Handle using recovery to reset CE engine and teardown the
faulty channel.
b. An INVALID_CONFIG -
- Triggered when LCE is mapped to floorswept PCE.
- On iGPU, we use the default PCE 2 LCE HW mapping.
The default mapping can be read from NV_CE_PCE2LCE_CONFIG
INIT value in CE refmanual.
- NvGPU driver configures the mapping on dGPUs (currently only on
Turing).
- So, this interrupt can only be triggered if there is
kernel or HW error
- Recovery ( which is killing the context + engine reset) will
not help resolve this error.
- Trigger Quiesce as part of handling.
c. A MTHD_BUFFER_FAULT -
- NvGPU driver allocates fault buffers for all TSGs or contexts,
maps them in BAR2 VA space and writes the VA into channel
instance block.
- Can be triggered only due to kernel bug
- Recovery will not help, need quiesce
d. FBUF_CRC_FAIL
- Triggered when the CRC entry read from the method fault buffer
does not match the computed CRC from the methods contained in
the buffer.
- This indicates memory corruption and is a fatal interrupt which
at least requires the LCE to be reset before operations can
start again, if not the entire GPU.
- Better to quiesce on memory corruption
CE Engine reset (via recovery) will not help.
e. FBUF_MAGIC_CHK_FAIL
- Triggered when the MAGIC_NUM entry read from the method fault
buf does not match NV_CE_MTHD_BUFFER_GLOBAL_HDR_MAGIC_NUM_VAL
- This indicates memory corruption and is a fatal interrupt
- Better to quiesce on memory corruption
f. STALLING_DEBUG
- Only triggered with SW write for debug purposes
- Debug interrupt, currently ignored
Move launch error handling from GP10b to GV11b HAL as -
1. LAUNCHERR_REPORT errcode METHOD_BUFFER_ACCESS_FAULT is not
defined on Pascal
2. We do not support GP10b on dev-main ToT
JIRA NVGPU-8102
Change-Id: Idc84119bc23b5e85f3479fe62cc8720e98b627a5
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2678893
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Add new HAL gops.gr.init.set_default_gfx_regs() to set graphics specific
PRI values for graphics contexts in function nvgpu_gr_obj_ctx_alloc().
Add new HAL gops.gr.init.capture_gfx_regs() to capture and save init
values for the PRIs. Add new struct nvgpu_gr_obj_ctx_gfx_regs to hold the
PRI init values.
Define HAL functions gv11b_gr_init_set_default_gfx_regs() and
gv11b_gr_init_capture_gfx_regs(). Set the HAL functions for
gv11b and ga10b.
Register accessors required to set PRIs are auto-generated.
Bug 3506078
Change-Id: I4c2843a274f3c924e402541e600e104ed0c9ed1c
Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2671598
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Shashank Singh <shashsingh@nvidia.com>
Reviewed-by: Jonathan Mccaffrey <jmccaffrey@nvidia.com>
GVS: Gerrit_Virtual_Submit
- MISRA Directive 4.7
Calling function "nvgpu_tsg_unbind_channel(tsg, ch, true)" which returns
error information without testing the error information.
- MISRA Rule 10.3
Implicit conversion from essential type "unsigned 64-bit int" to different
or narrower essential type "unsigned 32-bit int"
- MISRA Rule 5.7
A tag name shall be a unique identifier
JIRA NVGPU-5955
Change-Id: I109e0c01848c76a0947848e91cc6bb17d4cf7d24
Signed-off-by: srajum <srajum@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2572776
(cherry picked from commit 073daafe8a11e86806be966711271be51d99c18e)
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2678681
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
This patch does the following:
- Defines a new flag "CONFIG_NVGPU_ENABLE_MISC_EC" when the
build is not targeted for RM Server. Since iGPU is in pass-through
mode in both safety and standard build, EPL libraries will not be
included in RM Server. This is done with the help of the flag
"NV_BUILD_CONFIGURATION_IS_VM_SERVER".
- Updates error id that will be reported to Safety_Services. The
format of the error ID is:
- HW_unit_id: (4-bits: bit 0 to 3),
- Error_id: (5-bits: bit 4 to 8),
- Corrected/Uncorrected error: (1-bit: bit-9),
- Remaining 22-bits are unused.
- Defines macros that will be used to form error ID.
- Defines a macro for SW_ERR_CODE_0 register which is allocated for
NvGPU to report errors to Safety_Services via MISC_EC.
JIRA NVGPU-8094
Change-Id: I02f37db75ef3b82952ef5f196f4e065d6c5d1a3e
Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2677373
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
This patch does the following:
- Adds error IDs for GSP ACR and GSP SCHED.
- Updates error IDs for PMU.
- Removes reporting of DMEM ECC_CORRECTED since DMEM RAMs in PWR is
protected only with parity mechanism, (ref: T23x_UPROC_Safety_IAS)
- Removes reporting of IMEM ECC_CORRECTED since IMEM RAMs for PROC in
PWR is protected only with parity mechanism, (ref: T23x_UPROC_Safety_IAS)
JIRA NVGPU-8094
Change-Id: I127e78b1aa76b552758d1fff5bc7a01b5f8f3e54
Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2677589
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Most SM RAMs are protected with parity (except L1 D-cache TAG mem
which is protected with SEC-DED ECC). The memory corruption errors
reported by these RAMs are therefore uncorrected errors only.
Remove the code to handle corrected errors from GR SM ECC.
The SM RAMS ECC errors currently report error to SDL using ID
GPU_SM_L1_TAG_ECC_(UN)CORRECTED. Update the error reporting to
use the newly created error IDs for Drive 6.0.
JIRA NVGPU-7987
Change-Id: Ic426d45f851d87aafaa7963b937535582cdafadf
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2674389
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
The current value of CTXSW_TIMEOUT (100ms) is too large and does not
meet the FTTI budget of 100ms. Update the value to 10 ms -
1. It seems well within FTTI - with some budget for recovery if
needed. The WCET for recovery is around 55ms.
2. It can be easily updated if needed later
Change-Id: If2ea3664c92d7426d1543d15614723e38b63aabd
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2672872
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
GVS: Gerrit_Virtual_Submit
Below CE interrupts do not have any users(usecases) on safety build;
disable them only on safety build.
1. BLOCKPIPE stall intr: Not used by GFX(VKSC) and CUDA on safety.
2. NONBLOCK_PIPE nonstall intr: Non-stall intrs are not supported
on safety build. Also, this one is not used by GFX(VKSC)
and CUDA.
3. STALLING_DEBUG intr: Added in Orin tree. It is only needed for
debugging. Disable on safety build as there is no current
usage in driver.
4. POISON_ERROR intr: Poison is a fault containment and not
supported on GA10b.
5. INVALID_CONFIG intr: Floor sweeping not supported on functional
safety SKU.
Bug 3548082
Change-Id: I8d97ccb38f138b2c04a780e1c255a64d28723405
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2671927
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
The PMA unit can only access GPU VAs within a 4GB window, hence both
the user allocated PMA buffer and the kernel allocated bytes available
buffer should lie in the same 4GB window. This is accomplished by
carving out and reserving a 4GB VA space in perbuf.vm and using fixed
GPU VAs to ensure that both buffers are bound within the same 4GB window.
In addition, update ALLOC_PMA_STREAM to use pma_buffer_offset,
pma_buffer_map_size fields correctly.
Bug 3503708
Change-Id: Ic5297a22c2db42b18ff5e676d565d3be3c1cd780
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2671637
Reviewed-by: svcacv <svcacv@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com>
GVS: Gerrit_Virtual_Submit