- Add defintions of the gfx/compute classes and methods that are
generated from the hw/sw class header files. Use these definitions
instead of the hard-coded ones so that mismatches may be caught by
the HAL checker.
- Abstract out the sw method handling functionality of
gr.intr.handle_sw_method into gr.intr.handle_gfx_sw_method and
gr.intr.handle_compute_sw_method and have gr.intr.handle_sw_method
call these two new HALs.
Jira NVGPU-9217
Change-Id: Ia30fcba6174878d9b5b7b5910c564c879a702ddc
Signed-off-by: Austin Tajiri <atajiri@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2885547
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
- On older chips, PMU uses CMD-MSG queue method to
communicate with NvGPU.
- From Turing onwards, PMU uses RPC method for this.
- During poweroff, we release pmu_sequence and reset the
members of the structure.
- For chips that use RPC, we need to free the payload as well
and then reset the members.
- Add pmu_seq_cleanup hal for this.
Bug 4019694
Bug 4059157
Change-Id: Ieb474fe4ed81f54d78480214cde53b51d45652c6
Signed-off-by: Divya <dsinghatwari@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2882267
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
On ga10b+ platforms, more VM space is needed to map various buffers
to bar2 vm. Engine method buffer is mapped for each pbdma and for
maximum supported TSGs this requires more than 32MB of space.
Also we need to consider fault buffer space and vab buffer
space requirement.
Bug 3958581
Change-Id: I9ee87119f762352ee12859b71c08a5f75b3554e0
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2872811
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Currently KMD_SCHEDULING_WORKER_THREAD can be enabled/disabled using
compile time flag but this flag does give ability to control the
feature based on the chip.
GSP is enabled only on ga10b where KMD_SCHEDULING_WORKER_THREAD should
be disabled while should be enabled for other chips at the same time
to support GVS tests.
Change adds enabled flag to control KMD_SCHEDULING_WORKER_THREAD based
on the chip.
Bug 3935433
Change-Id: I9d2f34cf172d22472bdc4614073d1fb88ea204d7
Signed-off-by: prsethi <prsethi@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2867023
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
gv11b onwards add hals
get_hwpm_gpcrouter_perfmon_regs_base
get_hwpm_fbprouter_perfmon_regs_base
And remove the ga10b version of same as that is redundant.
This is preparatory patch to update the gr_gv11b_pri_pmmgpcrouter_addr
and gr_gv11b_pri_pmmfbprouter_addr with the hals
JIRA NVGPU-9073
Change-Id: I8b04f9b61784ca2c09b248655435ea7a7ab92926
Signed-off-by: Ramalingam C <ramalingamc@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2828584
Reviewed-by: svcacv <svcacv@nvidia.com>
Reviewed-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-by: Ankur Kishore <ankkishore@nvidia.com>
GVS: Gerrit_Virtual_Submit <buildbot_gerritrpt@nvidia.com>
Subcontext PDBs and valid mask in the instance blocks of the channels
in various subcontexts has to be updated when new subcontext is
created or a subcontext is removed.
Replayable fault state is cached in the channel structure. Replayable
fault state for subcontext is set based on first channel's bind
parameter. It was earlier programmed in function channel_setup_ramfc.
init_inst_block_core is updated to setup TSG level pdb map and mask.
Added new hal gv11b_channel_bind to enable the subcontext on channel
bind.
Bug 3677982
Change-Id: I58156c5b3ab6309b6a4b8e72b0e798d6a39c1bee
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2719994
Reviewed-by: Ankur Kishore <ankkishore@nvidia.com>
GVS: Gerrit_Virtual_Submit <buildbot_gerritrpt@nvidia.com>
The wait_pending HAL is now modified to simply
check the pending status of a given runlist.
The while loop is removed from this HAL.
A new function nvgpu_runlist_wait_pending_legacy() is
added that emulates the older wait_pending() HAL.
nvgpu_runlist_tick() is modified to accept a 64 bit
"preempt_grace_ns" value.
These changes prepare for upcoming control-fifo parser
changes.
Jira NVGPU-8619
Signed-off-by: Debarshi Dutta <ddutta@nvidia.com>
Change-Id: If3f288eb6f2181743c53b657219b3b30d56d26bc
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2766100
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
This change is reading the live pes from the register
"gr_gpc0_gpm_pd_live_physical_pes_r" and set it to
"gr_gpc0_swdx_pes_mask_r".
Every PES needs at least a TPC to work. If any of the TPCs
are floorswept,the live PES mask is read from
"gr_gpc0_gpm_pd_live_physical_pes_r" and the corresponding
active PES mask is updated in "gr_gpc0_swdx_pes_mask_r".
Bug 3677421
Change-Id: I899ac41c4a82beb3ce75c84ad57dcad262a49ba1
Signed-off-by: Dinesh T <dt@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2736560
(cherry picked from commit 85f2ceb3db6eeef925b49553f445d8cc31ec39da)
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2759135
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
GVS: Gerrit_Virtual_Submit
While creating a new channel, ioctls are called in the below sequence:
1. GPU_IOCTL_OPEN_CHANNEL
2. AS_IOCTL_BIND_CHANNEL
3. TSG_IOCTL_BIND_CHANNEL_EX
4. CHANNEL_ALLOC_GPFIFO_EX
5. CHANNEL_ALLOC_OBJ_CTX.
subctx pdbs and valid mask are programmed in the channel instance block
in the channel ioctls AS_IOCTL_BIND_CHANNEL & CHANNEL_ALLOC_GPFIFO_EX.
Programming them in the ioctl AS_IOCTL_BIND_CHANNEL is redundant.
Remove related hal g->ops.mm.init_inst_block_for_subctxs.
The hal init_inst_block will program context pdb and big page size.
The hal init_inst_block_core will program context pdb, big page size
and subctx 0 pdb. This is used by h/w units (fecs, pmu, hwpm, bar1,
bar2, sec2, gsp, perfbuf etc.).
For user channels, subctx pdbs are programmed as part of ramfc setup.
Bug 3677982
Change-Id: I6656b002d513404c1fd7c3d349933e80cca7e604
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2680907
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Patch defines a ZBC static table and configure it at sw layer. Later
existing API read this sw configuration and program it to hw.
This is applicable only for ga10b safety build and for other chips/
configuration it will be supported in the legacy way.
Bug 3585766
Change-Id: I00d79162c0b096616e3f555da965e82e47c014d1
Signed-off-by: prsethi <prsethi@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2713821
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
BVEC changes for nvgpu_rc_pbdma_fault and nvgpu_rc_mmu_fault
started reporting below MISRA issue.
kernel/nvgpu/drivers/gpu/nvgpu/common/fifo/tsg.c:321:
1. misra_c_2012_rule_5_1_violation: Declaration with identifier
"nvgpu_tsg_unbind_channel_check_hw_state", which is ambiguous.
kernel/nvgpu/drivers/gpu/nvgpu/common/fifo/tsg.c:349:
2. other_declaration: The first 31 characters of identifiers
"nvgpu_tsg_unbind_channel_check_ctx_reload" and
"nvgpu_tsg_unbind_channel_check_hw_state" are identical.
Do below renames to fix the issue. Doing both for consistency.
s/nvgpu_tsg_unbind_channel_check_hw_state/nvgpu_tsg_unbind_channel_hw_state_check
s/nvgpu_tsg_unbind_channel_check_ctx_reload/nvgpu_tsg_unbind_channel_ctx_reload_check
JIRA NVGPU-6772
Change-Id: Ib92cabe11c486621351bf15ddb86e20d16d514c4
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2584152
(cherry picked from commit a619f259c6a4ffccb05550767212989af60c2a90)
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2706551
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com>
GVS: Gerrit_Virtual_Submit
Volta+ chips supports PES floorsweeping and Ampere+(iGPU) chips supports
ROP floorsweeping. At present, the driver isn't aware of PES, ROP
floorsweeping, make the driver PES, ROP floorsweeping aware by introducing the
following fields in nvgpu_gr_config:
- gpc_(rop/pes)_mask: Contains the bit mask of non FSed ROP/PES units per GPC.
- gpc_(rop/pes)_logical_id_map: Translates per GPC ROP/PES physical id to
logical id.
Introduce the following HAL functions to read PES/ROP FS data:
- gops_fuse.fuse_status_opt_(pes/rop)_gpc: This fuction gets the FS
config from the fuse.
- gops_top.get_max_(pes/rop)_per_gpc: Gets the maximum number of PES/ROP
units that can be present in a GPC.
In addition, introduce the enabled flag NVGPU_SUPPORT_PES_FS to identify chips
which support PES floorsweeping, piggyback on NVGPU_SUPPORT_ROP_IN_GPC
enabled flag to identify ROP floorsweeping.
Bug 3524791
Change-Id: I065bab6c02618fe38892c8c890b069c340b85301
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2679570
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
a. LAUNCH_ERR
- Userspace error.
- Triggered due to faulty launch.
- Handle using recovery to reset CE engine and teardown the
faulty channel.
b. An INVALID_CONFIG -
- Triggered when LCE is mapped to floorswept PCE.
- On iGPU, we use the default PCE 2 LCE HW mapping.
The default mapping can be read from NV_CE_PCE2LCE_CONFIG
INIT value in CE refmanual.
- NvGPU driver configures the mapping on dGPUs (currently only on
Turing).
- So, this interrupt can only be triggered if there is
kernel or HW error
- Recovery ( which is killing the context + engine reset) will
not help resolve this error.
- Trigger Quiesce as part of handling.
c. A MTHD_BUFFER_FAULT -
- NvGPU driver allocates fault buffers for all TSGs or contexts,
maps them in BAR2 VA space and writes the VA into channel
instance block.
- Can be triggered only due to kernel bug
- Recovery will not help, need quiesce
d. FBUF_CRC_FAIL
- Triggered when the CRC entry read from the method fault buffer
does not match the computed CRC from the methods contained in
the buffer.
- This indicates memory corruption and is a fatal interrupt which
at least requires the LCE to be reset before operations can
start again, if not the entire GPU.
- Better to quiesce on memory corruption
CE Engine reset (via recovery) will not help.
e. FBUF_MAGIC_CHK_FAIL
- Triggered when the MAGIC_NUM entry read from the method fault
buf does not match NV_CE_MTHD_BUFFER_GLOBAL_HDR_MAGIC_NUM_VAL
- This indicates memory corruption and is a fatal interrupt
- Better to quiesce on memory corruption
f. STALLING_DEBUG
- Only triggered with SW write for debug purposes
- Debug interrupt, currently ignored
Move launch error handling from GP10b to GV11b HAL as -
1. LAUNCHERR_REPORT errcode METHOD_BUFFER_ACCESS_FAULT is not
defined on Pascal
2. We do not support GP10b on dev-main ToT
JIRA NVGPU-8102
Change-Id: Idc84119bc23b5e85f3479fe62cc8720e98b627a5
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2678893
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Below CE interrupts do not have any users(usecases) on safety build;
disable them only on safety build.
1. BLOCKPIPE stall intr: Not used by GFX(VKSC) and CUDA on safety.
2. NONBLOCK_PIPE nonstall intr: Non-stall intrs are not supported
on safety build. Also, this one is not used by GFX(VKSC)
and CUDA.
3. STALLING_DEBUG intr: Added in Orin tree. It is only needed for
debugging. Disable on safety build as there is no current
usage in driver.
4. POISON_ERROR intr: Poison is a fault containment and not
supported on GA10b.
5. INVALID_CONFIG intr: Floor sweeping not supported on functional
safety SKU.
Bug 3548082
Change-Id: I8d97ccb38f138b2c04a780e1c255a64d28723405
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2671927
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Replace ga10b_ptimer_isr with gk20a_ptimer_isr.
Remove GPU_PRI_ACCESS_VIOLATION reporting from gp10b hal as only
ga10b should be reporting these errors.
GPU_PRI_TIMEOUT_ERROR was only reported from ptimer ISR. However,
it is to be reported when error code is 0xbadf10xx that can be
seen through priv_ring ISR as well. Hence report this error
from ga10b_priv_ring_decode_error_code called from both bus
and priv_ring isr.
For other error cases GPU_PRI_ACCESS_VIOLATION is reported.
Other updates for priv_ring error handling are given below:
1. Add extra info decode functions for error codes:
- 0xbad001xx, 0xbad002xx, 0xbad0daxx - decode_host_pri_error
- 0xbadf13xx - decode_fecs_floorsweep_error
- 0xbadf24xx, 0xbadf25xx, 0xbadf26xx - decode_gcgpc_error &
decode_pri_local_decode_error
- 0xbadf20xx, 0xbadf22xx - decode_fecs_pri_orphan_error
- 0xbadf52xx - decode_pri_indirect_access_violation
- 0xbadf60xx - decode_pri_lock_sec_sensor_violation
2. Add more info prints to decode_pri_falcom_mem_violation.
3. Add entry for extra info corresponding to 0x41 to
pri_client_error_extra_4x.
4. Separate extra info decode function for error 0xbadf50xx.
JIRA NVGPU-7986
Change-Id: I519a66e8a7a158de23ced5a092a2ebfd62c305be
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2671337
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
GVS: Gerrit_Virtual_Submit
The timestamp control register in the SMCARB should be configured to have
the NV_PSMCARB_TIMESTAMP_CTRL_DISABLE_TICK field cleared, otherwise the PTIMER
ticks will not be sent to GR engine. Hence, remove the pre-processor checks
around grmgr.load_timestamp_prod call.
Bug 3510460
Bug 3500065
Change-Id: I223cea1aca28a9215287f540eb961a16e3fe6626
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2671021
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Add following HALs for Ga100 and Ga10b. These will
be used for calculating chiplet offsets corresponding
to GPC/FBP perf register.
get_pmmgpcrouter_per_chiplet_offset
get_pmmfbprouter_per_chiplet_offset
get_hwpm_fbp_perfmon_regs_base
get_hwpm_gpc_perfmon_regs_base
get_hwpm_fbprouter_perfmon_regs_base
get_hwpm_gpcrouter_perfmon_regs_base
Bug 200712091
Signed-off-by: Debarshi Dutta <ddutta@nvidia.com>
Change-Id: Iec1a16ef4a3c26dca054c30d95bef991983dc2b7
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2648832
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
At present, for each resume cycle the driver sends the
"nvgpu_cbc_op_clear" command to L2 cache controller, this causes the
contents of the compression bit backing store to be cleared, and results
in corrupting the metadata for all the compressible surfaces already allocated.
Fix this by updating cbc.init function to be aware of resume state and
not clear the compression bit backing store, instead issue
"nvgpu_cbc_op_invalide" command, this should leave the backing store in a
consistent state across suspend/resume cycles.
The updated cbc.init HAL for gv11b is reusable acrosss multiple chips, hence
remove unnecessary chip specific cbc.init HALs.
Bug 3483688
Change-Id: I2de848a083436bc085ee98e438874214cb61261f
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2660075
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
The flag pmu->pg->golden_image_initialized is set to
true during initial GPU context creation and is not
cleared while the GPU goes into pm_suspend (during railgate).
Hence, when the GPU resumes after un-railgate it retains
the previous value which can cause ELPG to kick in immediately.
Due to this, when ELPG and Railgating are enabled, IDLE_SNAP
is seen for read access of gr_gpc0_tpc0_sm_arch_r reg.
To resolve this, if golden image is ready set the
pmu->pg->golden_image_initialized to suspend state during railgate,
to delay the early enable of ELPG. Add a new
pmu_init_golden_img_state hal in the NVGPU_INIT_TABLE_ENTRY.
This will be called after all the GR access is done and GPU resumes
completely after un-railgate. This hal will then check if
golden_image_initialized flag is in suspend state, it will set it
to ready state and then re-enable ELPG.
Bug 3431798
Change-Id: I1fee83e66e09b6b78d385bbe60529d0724f79e79
Signed-off-by: Divya <dsinghatwari@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2639188
Reviewed-by: Mahantesh Kumbar <mkumbar@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit