nvgpu_device_get can return NULL if supplied invalid ID or instance
ID. We expect GR device struct to be non-NULL there hence just
assert that it is indeed non-NULL in gr_reset_engine and
ga10b_grmgr_init_gr_manager.
CID 224133
CID 250232
Bug 3512546
Change-Id: Id09a1c436a8e49b921111b940d3d013bd66bff7a
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2707018
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
The emulate mode support is determined after chip detect and is flagged
by using NVGPU_SUPPORT_EMULATE_MODE flag. The present logic prevents
user from configuring the emulate mode sysfs knobs if this flag is not
set, however the emulate mode usecase requires the user to configure the
syfs knob prior to power-on, hence defer emulate mode check to a later
stage after chip detect.
Bug 3621460
Change-Id: If522527542fa8d7e95ccbcff43b74adbb9e976e6
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2703953
Reviewed-by: svcacv <svcacv@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Mayur Poojary <mpoojary@nvidia.com>
Reviewed-by: Mahantesh Kumbar <mkumbar@nvidia.com>
Reviewed-by: Ankur Kishore <ankkishore@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: David Li <davli@nvidia.com>
Fixed following Coverity Defects:
ioctl_as.c : Bad bit shift operation
mc_tu104.c : Bad bit shift operation
vm.c : Bad bit shift operation
vm_remap.c : Bad bit shift operation
A new linux header file for ilog2 is created.
The files which used the old ilog2 function
have been changed to use the new nvgpu_ilog2
function.
CID 9847922
CID 9869507
CID 9859508
CID 10112314
CID 10127813
CID 10127899
CID 10128004
Signed-off-by: Jinesh Parakh <jparakh@nvidia.com>
Change-Id: Ia201eea7cc426c3d6581e1e5ae3b882dbab3b490
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2700994
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Disable below interrupts on safety as they do not report any error
condition and are not used by CUDA and Graphics(VKSC) on safety
build.
Signoff from CUDA and VKSC is on Bug https://nvbugs/3588603
1. NV_PGRAPH_INTR_NOTIFY: This intr is set when the Notification
style is WRITE_THEN_AWAKEN.
2. NV_PGRAPH_INTR_SEMAPHORE: This is set when a 3d class sempahore is
released as the result ofa SetSemaphoreD method, when the
AwakenEnable field is TRUE.
3. NV_PGRAPH_INTR_BUFFER_NOTIFY: This bit is set when a Mem2mem DMA
completes and the LaunchDma method specifies the interrupt type
as INTERRUPT
4. NV_PGRAPH_INTR_DEBUG_METHODS: This is debug feature and not used
on QNX safety
Bug 3588603
JIRA NVGPU-8166
Change-Id: I6d07dfd2857ac047fac4599421600d364251df76
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2694363
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Update PES, ROP exception handling for NVGPU_ERRATA_3524791. Enable the
errata for all Volta+ chips.
ROP, PES exceptions are being reported using the physical-id,
where logical-id should have been used. All ESR status registers are
reported using logical-id, so this matches with the SW expectation.
To address the (1), update ROP, PES exception handler translate from
physical to logical-id before reading the status registers.
Bug 3524791
Change-Id: Ieacbfb306bb0e69cf0113dc92f18e401573722e3
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2680029
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Volta+ chips supports PES floorsweeping and Ampere+(iGPU) chips supports
ROP floorsweeping. At present, the driver isn't aware of PES, ROP
floorsweeping, make the driver PES, ROP floorsweeping aware by introducing the
following fields in nvgpu_gr_config:
- gpc_(rop/pes)_mask: Contains the bit mask of non FSed ROP/PES units per GPC.
- gpc_(rop/pes)_logical_id_map: Translates per GPC ROP/PES physical id to
logical id.
Introduce the following HAL functions to read PES/ROP FS data:
- gops_fuse.fuse_status_opt_(pes/rop)_gpc: This fuction gets the FS
config from the fuse.
- gops_top.get_max_(pes/rop)_per_gpc: Gets the maximum number of PES/ROP
units that can be present in a GPC.
In addition, introduce the enabled flag NVGPU_SUPPORT_PES_FS to identify chips
which support PES floorsweeping, piggyback on NVGPU_SUPPORT_ROP_IN_GPC
enabled flag to identify ROP floorsweeping.
Bug 3524791
Change-Id: I065bab6c02618fe38892c8c890b069c340b85301
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2679570
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Add new profiler resource type NVGPU_PROFILER_PM_RESOURCE_TYPE_PC_SAMPLER.
Introduce regops HAL get_hwpm_pc_sampler_register_ranges to get
allowlist for PC_SAMPLER resources. Re-generate allowlist files to include
register ranges for PC_SAMPLER resources.
Update uapi header to advertise new resource type
NVGPU_PROFILER_PM_RESOURCE_ARG_PC_SAMPLER.
Bug 3408536
Change-Id: I7009ef822665771eed727da48ef1e89dcc6b9c4b
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2689057
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
GVS: Gerrit_Virtual_Submit
Below listed HSI are handled with PMU ISR handler and
all these triggers interrupt from individual unit upon issue.
-Add ECC check for IMEM, DMEM, DCLS, REG, and MPU as per
HSI req
-Add MEMERR check for GPU_PMU_ACCESS_TIMEOUT_UNCORRECTED
PMU HSI id
-Add IOPMP check for GPU_PMU_ILLEGAL_ACCESS_UNCORRECTED
PMU HSI id
-Add WDT check for GPU_PMU_WDT_UNCORRECTED PMU HSI id
Bug 3491596
Bug 3366818
Change-Id: I751d653e447017ac62a2459da2c6bb9da506f438
Signed-off-by: mkumbar <mkumbar@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2686566
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
In Drive 6.0, the error reporting is supported only for orin (ga10b)
in dev-main. For this purpose, this patch does the following:
- Removes the redundant reporting of following IDs from gv11b:
- GPU_HOST_PFIFO_SCHED_ERROR
- GPU_HOST_PFIFO_CTXSW_TIMEOUT_ERROR
- GPU_HOST_PBDMA_HCE_ERROR
- GPU_MMU_L1TLB_SA_DATA_ECC_UNCORRECTED
- GPU_MMU_L1TLB_FA_DATA_ECC_UNCORRECTED
- GPU_LTC_CACHE_DSTG_ECC_CORRECTED
- GPU_LTC_CACHE_TSTG_ECC_UNCORRECTED
- Migrates the reporting of following IDs from gv11b to ga10b:
- GPU_SM_L1_TAG_ECC_CORRECTED
- GPU_SM_L1_TAG_ECC_UNCORRECTED
- GPU_SM_CBU_ECC_UNCORRECTED
- GPU_SM_LRF_ECC_UNCORRECTED
- GPU_SM_L1_DATA_ECC_UNCORRECTED
- GPU_SM_ICACHE_L1_DATA_ECC_UNCORRECTED
- GPU_SM_ICACHE_L0_PREDECODE_ECC_UNCORRECTED
- GPU_SM_L1_TAG_MISS_FIFO_ECC_UNCORRECTED
- GPU_SM_L1_TAG_S2R_PIXPRF_ECC_UNCORRECTED
- Removes the unused ID that doesn't have any HSI related to it:
- GPU_HOST_PBDMA_PREEMPT_ERROR
In addition to the above, this patch does the following:
- Updates error IDs related to page fault error.
- Updates look-up table to remove unused error IDs.
JIRA NVGPU-8094
Bug 200729736
Change-Id: Ifea76d38ba609c894560e61ff5a6e406290f919e
Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2685249
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Dinesh T <dt@nvidia.com>
Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com>
GVS: Gerrit_Virtual_Submit
-Add check for ECC parity errors in IMEM, DMEM, EMEM, DCLS, REG
for ACR running in GSP engine.
The EXTIRQ3 external interrupt is set from ACR pointing towards host.
-Add function to check error type when ACR or Bootrom execution fails
and report accordingly to SDL with relevant error codes.
This is a part of HSI safety requirements.
Bug 3564039
Jira NVGPU-8108
Change-Id: I65407371f7a1d1ba50a10bdf443ef6b903eeaa36
Signed-off-by: mpoojary <mpoojary@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2678100
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
a. LAUNCH_ERR
- Userspace error.
- Triggered due to faulty launch.
- Handle using recovery to reset CE engine and teardown the
faulty channel.
b. An INVALID_CONFIG -
- Triggered when LCE is mapped to floorswept PCE.
- On iGPU, we use the default PCE 2 LCE HW mapping.
The default mapping can be read from NV_CE_PCE2LCE_CONFIG
INIT value in CE refmanual.
- NvGPU driver configures the mapping on dGPUs (currently only on
Turing).
- So, this interrupt can only be triggered if there is
kernel or HW error
- Recovery ( which is killing the context + engine reset) will
not help resolve this error.
- Trigger Quiesce as part of handling.
c. A MTHD_BUFFER_FAULT -
- NvGPU driver allocates fault buffers for all TSGs or contexts,
maps them in BAR2 VA space and writes the VA into channel
instance block.
- Can be triggered only due to kernel bug
- Recovery will not help, need quiesce
d. FBUF_CRC_FAIL
- Triggered when the CRC entry read from the method fault buffer
does not match the computed CRC from the methods contained in
the buffer.
- This indicates memory corruption and is a fatal interrupt which
at least requires the LCE to be reset before operations can
start again, if not the entire GPU.
- Better to quiesce on memory corruption
CE Engine reset (via recovery) will not help.
e. FBUF_MAGIC_CHK_FAIL
- Triggered when the MAGIC_NUM entry read from the method fault
buf does not match NV_CE_MTHD_BUFFER_GLOBAL_HDR_MAGIC_NUM_VAL
- This indicates memory corruption and is a fatal interrupt
- Better to quiesce on memory corruption
f. STALLING_DEBUG
- Only triggered with SW write for debug purposes
- Debug interrupt, currently ignored
Move launch error handling from GP10b to GV11b HAL as -
1. LAUNCHERR_REPORT errcode METHOD_BUFFER_ACCESS_FAULT is not
defined on Pascal
2. We do not support GP10b on dev-main ToT
JIRA NVGPU-8102
Change-Id: Idc84119bc23b5e85f3479fe62cc8720e98b627a5
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2678893
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
g->fifo.runlists[] has size of g->fifo.max_runlists. During quiesce,
U32_MAX bitmask is passed to g->ops.runlist.write_state() HAL to
disable all the runlist. The Ga10b HAL implementation of
g->ops.runlist.write_state() references into runlists[] structure
for all the bits set in input runlist mask. For mask=U32_MAX,
there is NULL pointer dereference when runlist_id exceeds
g->fifo.max_runlists.
Add runlist_id boundary check before dereferencing the runlists[]
structure.
Update Gk20a HAL too with similar guard to make sure incorrect mask
doesn't get written to the register.
JIRA NVGPU-8102
Change-Id: Ic613aa38361b8b23d953c76d6924aba6bf6d5ea9
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2680847
Reviewed-by: Konsta Holtta <kholtta@nvidia.com>
Reviewed-by: svcacv <svcacv@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
GVS: Gerrit_Virtual_Submit
Add new HAL gops.gr.init.set_default_gfx_regs() to set graphics specific
PRI values for graphics contexts in function nvgpu_gr_obj_ctx_alloc().
Add new HAL gops.gr.init.capture_gfx_regs() to capture and save init
values for the PRIs. Add new struct nvgpu_gr_obj_ctx_gfx_regs to hold the
PRI init values.
Define HAL functions gv11b_gr_init_set_default_gfx_regs() and
gv11b_gr_init_capture_gfx_regs(). Set the HAL functions for
gv11b and ga10b.
Register accessors required to set PRIs are auto-generated.
Bug 3506078
Change-Id: I4c2843a274f3c924e402541e600e104ed0c9ed1c
Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2671598
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Shashank Singh <shashsingh@nvidia.com>
Reviewed-by: Jonathan Mccaffrey <jmccaffrey@nvidia.com>
GVS: Gerrit_Virtual_Submit
This patch does the following:
- Adds error IDs for GSP ACR and GSP SCHED.
- Updates error IDs for PMU.
- Removes reporting of DMEM ECC_CORRECTED since DMEM RAMs in PWR is
protected only with parity mechanism, (ref: T23x_UPROC_Safety_IAS)
- Removes reporting of IMEM ECC_CORRECTED since IMEM RAMs for PROC in
PWR is protected only with parity mechanism, (ref: T23x_UPROC_Safety_IAS)
JIRA NVGPU-8094
Change-Id: I127e78b1aa76b552758d1fff5bc7a01b5f8f3e54
Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2677589
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Most SM RAMs are protected with parity (except L1 D-cache TAG mem
which is protected with SEC-DED ECC). The memory corruption errors
reported by these RAMs are therefore uncorrected errors only.
Remove the code to handle corrected errors from GR SM ECC.
The SM RAMS ECC errors currently report error to SDL using ID
GPU_SM_L1_TAG_ECC_(UN)CORRECTED. Update the error reporting to
use the newly created error IDs for Drive 6.0.
JIRA NVGPU-7987
Change-Id: Ic426d45f851d87aafaa7963b937535582cdafadf
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2674389
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Below CE interrupts do not have any users(usecases) on safety build;
disable them only on safety build.
1. BLOCKPIPE stall intr: Not used by GFX(VKSC) and CUDA on safety.
2. NONBLOCK_PIPE nonstall intr: Non-stall intrs are not supported
on safety build. Also, this one is not used by GFX(VKSC)
and CUDA.
3. STALLING_DEBUG intr: Added in Orin tree. It is only needed for
debugging. Disable on safety build as there is no current
usage in driver.
4. POISON_ERROR intr: Poison is a fault containment and not
supported on GA10b.
5. INVALID_CONFIG intr: Floor sweeping not supported on functional
safety SKU.
Bug 3548082
Change-Id: I8d97ccb38f138b2c04a780e1c255a64d28723405
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2671927
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
The PMA unit can only access GPU VAs within a 4GB window, hence both
the user allocated PMA buffer and the kernel allocated bytes available
buffer should lie in the same 4GB window. This is accomplished by
carving out and reserving a 4GB VA space in perbuf.vm and using fixed
GPU VAs to ensure that both buffers are bound within the same 4GB window.
In addition, update ALLOC_PMA_STREAM to use pma_buffer_offset,
pma_buffer_map_size fields correctly.
Bug 3503708
Change-Id: Ic5297a22c2db42b18ff5e676d565d3be3c1cd780
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2671637
Reviewed-by: svcacv <svcacv@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com>
GVS: Gerrit_Virtual_Submit
Falcon safety debug flag was previously disabled for safety debug
profile. This patch enables the flag support for safety debug.
copy_from_dmem function is required to copy the debug info from
dmem debug buffer whenever there's an error generated.
Hence, moved copy_from_dmem function to fusa file from non-fusa
and added ifdef condition to only enable when non-fusa or falcon debug
flag is set.
Also, some fixes for type conversion error in falcon_debug.c during
compilation.
Bug 3482988
Change-Id: Ic0ea32b3227b84d4ba0835e6e1aeb40f58ec7327
Signed-off-by: mpoojary <mpoojary@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2673900
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Allowlist and register ranges are declared global. Sparse throws warning
for them as:
- allowlist_ga100.c:351:5: warning: symbol
'ga100_cau_register_offset_allowlist' was not declared.
Should it be static?
- allowlist_ga100.c:389:47: warning: symbol
'ga100_hwpm_pma_trigger_register_ranges' was not declared.
Should it be static?
Make these arrays static global const.
Bug 3528472
Change-Id: I319f36c1579c630632b994295677c5831c1bff6b
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2676591
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Sachin Nikam <snikam@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
GVS: Gerrit_Virtual_Submit
Replace ga10b_ptimer_isr with gk20a_ptimer_isr.
Remove GPU_PRI_ACCESS_VIOLATION reporting from gp10b hal as only
ga10b should be reporting these errors.
GPU_PRI_TIMEOUT_ERROR was only reported from ptimer ISR. However,
it is to be reported when error code is 0xbadf10xx that can be
seen through priv_ring ISR as well. Hence report this error
from ga10b_priv_ring_decode_error_code called from both bus
and priv_ring isr.
For other error cases GPU_PRI_ACCESS_VIOLATION is reported.
Other updates for priv_ring error handling are given below:
1. Add extra info decode functions for error codes:
- 0xbad001xx, 0xbad002xx, 0xbad0daxx - decode_host_pri_error
- 0xbadf13xx - decode_fecs_floorsweep_error
- 0xbadf24xx, 0xbadf25xx, 0xbadf26xx - decode_gcgpc_error &
decode_pri_local_decode_error
- 0xbadf20xx, 0xbadf22xx - decode_fecs_pri_orphan_error
- 0xbadf52xx - decode_pri_indirect_access_violation
- 0xbadf60xx - decode_pri_lock_sec_sensor_violation
2. Add more info prints to decode_pri_falcom_mem_violation.
3. Add entry for extra info corresponding to 0x41 to
pri_client_error_extra_4x.
4. Separate extra info decode function for error 0xbadf50xx.
JIRA NVGPU-7986
Change-Id: I519a66e8a7a158de23ced5a092a2ebfd62c305be
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2671337
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
GVS: Gerrit_Virtual_Submit