Most SM RAMs are protected with parity (except L1 D-cache TAG mem
which is protected with SEC-DED ECC). The memory corruption errors
reported by these RAMs are therefore uncorrected errors only.
Remove the code to handle corrected errors from GR SM ECC.
The SM RAMS ECC errors currently report error to SDL using ID
GPU_SM_L1_TAG_ECC_(UN)CORRECTED. Update the error reporting to
use the newly created error IDs for Drive 6.0.
JIRA NVGPU-7987
Change-Id: Ic426d45f851d87aafaa7963b937535582cdafadf
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2674389
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
The current value of CTXSW_TIMEOUT (100ms) is too large and does not
meet the FTTI budget of 100ms. Update the value to 10 ms -
1. It seems well within FTTI - with some budget for recovery if
needed. The WCET for recovery is around 55ms.
2. It can be easily updated if needed later
Change-Id: If2ea3664c92d7426d1543d15614723e38b63aabd
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2672872
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
GVS: Gerrit_Virtual_Submit
Below CE interrupts do not have any users(usecases) on safety build;
disable them only on safety build.
1. BLOCKPIPE stall intr: Not used by GFX(VKSC) and CUDA on safety.
2. NONBLOCK_PIPE nonstall intr: Non-stall intrs are not supported
on safety build. Also, this one is not used by GFX(VKSC)
and CUDA.
3. STALLING_DEBUG intr: Added in Orin tree. It is only needed for
debugging. Disable on safety build as there is no current
usage in driver.
4. POISON_ERROR intr: Poison is a fault containment and not
supported on GA10b.
5. INVALID_CONFIG intr: Floor sweeping not supported on functional
safety SKU.
Bug 3548082
Change-Id: I8d97ccb38f138b2c04a780e1c255a64d28723405
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2671927
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
The PMA unit can only access GPU VAs within a 4GB window, hence both
the user allocated PMA buffer and the kernel allocated bytes available
buffer should lie in the same 4GB window. This is accomplished by
carving out and reserving a 4GB VA space in perbuf.vm and using fixed
GPU VAs to ensure that both buffers are bound within the same 4GB window.
In addition, update ALLOC_PMA_STREAM to use pma_buffer_offset,
pma_buffer_map_size fields correctly.
Bug 3503708
Change-Id: Ic5297a22c2db42b18ff5e676d565d3be3c1cd780
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2671637
Reviewed-by: svcacv <svcacv@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com>
GVS: Gerrit_Virtual_Submit
Falcon safety debug flag was previously disabled for safety debug
profile. This patch enables the flag support for safety debug.
copy_from_dmem function is required to copy the debug info from
dmem debug buffer whenever there's an error generated.
Hence, moved copy_from_dmem function to fusa file from non-fusa
and added ifdef condition to only enable when non-fusa or falcon debug
flag is set.
Also, some fixes for type conversion error in falcon_debug.c during
compilation.
Bug 3482988
Change-Id: Ic0ea32b3227b84d4ba0835e6e1aeb40f58ec7327
Signed-off-by: mpoojary <mpoojary@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2673900
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Allowlist and register ranges are declared global. Sparse throws warning
for them as:
- allowlist_ga100.c:351:5: warning: symbol
'ga100_cau_register_offset_allowlist' was not declared.
Should it be static?
- allowlist_ga100.c:389:47: warning: symbol
'ga100_hwpm_pma_trigger_register_ranges' was not declared.
Should it be static?
Make these arrays static global const.
Bug 3528472
Change-Id: I319f36c1579c630632b994295677c5831c1bff6b
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2676591
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Sachin Nikam <snikam@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
GVS: Gerrit_Virtual_Submit
Move ACR WPR init region cmd from ISR to LSFM as part of LSF bootstrap
request to execute the ACR commands sequentially as well as a blocking
call by polling is_wpr_init_done status till set to true. Needed to
add dealy after each ACR command for ga10b LSPMU due to nvriscv priv
lockdown for ACR commands asynchronously from the nvgpu as detailed
below,
LSPMU engages priv lockdown whenever ACR commands needs to be
processed, and nvgpu polls for interrupt status by polling
pwr_falcon_irqstat_r registers once command is sent to PMU to
process the ACK message from LSPMU if priv lockdown is not
engaged. During NVRISCV priv lockdown couple of register are
not accessible including irqstat register, priv lockdown is
done by LSPMU upon ACR command receive and its asynchronous
to nvgpu which cause nvgpu irqstat read data to be 0xbadf*
during polling at corner cases even though priv lockdown
check is present and interpreting wrongly the irq stat
register.
Add delay of 5ms after ACR command sent to LSPMU(LSPMU takes
~3.5msec to complete the command process) and before polling
the irqstat register in nvgpu to engage priv lockdown in LSPMU.
This additional delay will help to skip reading the irqstat at
corner case during the priv lockdown process.
Bug 3464141
Bug 3482947
Change-Id: I494493a92f6ede5dcb876aeb0d76d54969f0f59e
Signed-off-by: mkumbar <mkumbar@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2673246
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
GVS: Gerrit_Virtual_Submit
Replace ga10b_ptimer_isr with gk20a_ptimer_isr.
Remove GPU_PRI_ACCESS_VIOLATION reporting from gp10b hal as only
ga10b should be reporting these errors.
GPU_PRI_TIMEOUT_ERROR was only reported from ptimer ISR. However,
it is to be reported when error code is 0xbadf10xx that can be
seen through priv_ring ISR as well. Hence report this error
from ga10b_priv_ring_decode_error_code called from both bus
and priv_ring isr.
For other error cases GPU_PRI_ACCESS_VIOLATION is reported.
Other updates for priv_ring error handling are given below:
1. Add extra info decode functions for error codes:
- 0xbad001xx, 0xbad002xx, 0xbad0daxx - decode_host_pri_error
- 0xbadf13xx - decode_fecs_floorsweep_error
- 0xbadf24xx, 0xbadf25xx, 0xbadf26xx - decode_gcgpc_error &
decode_pri_local_decode_error
- 0xbadf20xx, 0xbadf22xx - decode_fecs_pri_orphan_error
- 0xbadf52xx - decode_pri_indirect_access_violation
- 0xbadf60xx - decode_pri_lock_sec_sensor_violation
2. Add more info prints to decode_pri_falcom_mem_violation.
3. Add entry for extra info corresponding to 0x41 to
pri_client_error_extra_4x.
4. Separate extra info decode function for error 0xbadf50xx.
JIRA NVGPU-7986
Change-Id: I519a66e8a7a158de23ced5a092a2ebfd62c305be
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2671337
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
GVS: Gerrit_Virtual_Submit
Include the scheduling domain name in the channel debug dump. The domain
name of a channel is the domain name of its parent TSG, if any. Copy
just the name into the dump info to avoid refcounting concerns.
While at it, reword the deterministic flag for less ambiguity.
Jira NVGPU-6791
Change-Id: I06041277f938e20f23de9aa419cfffbaa028035e
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2673101
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Replace id-based lookup with fd-based lookup when binding a TSG to a
domain. The device node based domain interface naturally provides access
control; this way userspace tools can limit which uid/gid can access
each domain.
Also, explicitly disallow binding channels to a TSG that has no runlist
domain yet. Normally a TSG is in the default domain if nothing else has
been specified, but the default domain can be deleted.
Jira NVGPU-6788
Change-Id: I2af96dfc002367d894eaf0c175006332f790c27f
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2651165
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
GVS: Gerrit_Virtual_Submit
Create device nodes for user-created scheduling domains. This helps
leverage filesystem based access control: domains can be chosen to be
available for a limited set of users on a system.
The device nodes are dynamic: they can be removed while the driver is
running normally. This is a bit different from the nodes that exist
until the driver is unloaded, so the devno/domain mapping is stored in a
separate list. The usual container_of pattern would suffer from an
unavoidable race condition if a domain file was opened while the same
domain would get removed.
As usual, domain refcounting prevents a domain from being removed. Now
the open device files hold refs and thus any open domain files prevent a
domain from getting removed, in addition to the userspace-invisible ref
that is taken when a TSG is bound to a domain.
While at it, make the query ioctl guarded by the sched domain mutex, as
domains might technically get added or removed during the querying code.
Jira NVGPU-6788
Change-Id: Ief2a09a442c4e70f1f2be8a32359341071d74659
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2651164
Reviewed-by: Alex Waterman <alexw@nvidia.com>
GVS: Gerrit_Virtual_Submit
Add a function to find the nvgpu_class of the v2 user device nodes. This
is the last entry in the class list, as the devices are created in that
order.
The v2 user class is not defined when MIG is enabled because there are
multiple logical devices; bigger changes would be needed for this.
Jira NVGPU-6788
Change-Id: I2177c1e5b4d0bbec77a4e258391859242b4f20d6
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2674427
Reviewed-by: svcacv <svcacv@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
GVS: Gerrit_Virtual_Submit
Allow gk20a_create_device() to happen outside the main ioctl logic and
rename it to have the modern nvgpu_ prefix. Add a separate function to
do cdev allocation and refactor the existing two callers slightly to
avoid repetition on the cdev struct initialization.
As a side effect, this modification fixes the error path that used to
not return an error if adding a device fails and also leaked the
allocated cdev memory.
Jira NVGPU-6788
Change-Id: Ia1f018b88d78fafdfcf4e95f6aa66e2368e58974
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2674426
Reviewed-by: Alex Waterman <alexw@nvidia.com>
GVS: Gerrit_Virtual_Submit
The existing Linux character device nodes are statically configured
once. For other dynamically created devices, track the next minor number
in nvgpu_os_linux as a rudimentary allocator.
Only a small number of increments are expected at this time; in the
future, a bitmap might be more appropriate for tracking out-of-order
deallocations too.
Jira NVGPU-6788
Change-Id: I016ee8471313086620f9ab371583d6763848b0e2
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2651163
Reviewed-by: svcacv <svcacv@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
GVS: Gerrit_Virtual_Submit
When device_create fails, take PTR_ERR from the subdev that was
returned. Commit e8bac374c0 ("gpu: nvgpu: Use device instead of
platform_device") refactored this code but forgot to rename the error
retrieval.
Change-Id: Id01adac431da77a71c8e71e1b01a065826f5ebcf
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2673712
Reviewed-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
GVS: Gerrit_Virtual_Submit
- Setting different tpc_pg_mask value leads to GPU crash.
- It is observed that with GPU railgating disabled, if
tpc_pg_mask is set, "the gpu is powered on" error is
reported and it won't allow to set the tpc_pg_mask, which
is expected.
- With GPU railgating enabled, the different tpc_pg_mask
value is set and the GPU is crashed.
- So, add check for golden image initialized before setting the
TPC, GPC and FBP PG mask.
- This check won't allow to update TPC, GPC and FBP mask after
golden image initialization and thus no GPU crash happens.
Bug 3544499
Change-Id: Ia003beaaec9dead22da74ea5862a81986780966b
Signed-off-by: Divya <dsinghatwari@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2672202
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Ninad Malwade <nmalwade@nvidia.com>
Reviewed-by: Seema Khowala <seemaj@nvidia.com>
Tested-by: Ninad Malwade <nmalwade@nvidia.com>
GVS: Gerrit_Virtual_Submit
The timestamp control register in the SMCARB should be configured to have
the NV_PSMCARB_TIMESTAMP_CTRL_DISABLE_TICK field cleared, otherwise the PTIMER
ticks will not be sent to GR engine. Hence, remove the pre-processor checks
around grmgr.load_timestamp_prod call.
Bug 3510460
Bug 3500065
Change-Id: I223cea1aca28a9215287f540eb961a16e3fe6626
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2671021
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
The ctxsw ucode saves all the ctxsw'ed TPC priv registers in the TPC
priv segment of the ctxsw image. In ga10b, these registers can be stored
in either of the two arrangements:
- INTERLEAVED: means the format is sorted by address first, then by TPC number
- MIGRATION: exact opposite of interleaved.
Update HAL functions gr_ga10b_process_context_buffer_priv_segment,
gr_ga10b_find_priv_offset_in_buffer to detect the register layout and
calculate the register offset accordingly.
Bug 200737000
Bug 3532165
Change-Id: I305509cf89498cb0c2c5bfa1d867272bdf5f42b3
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2665491
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
This patch does the following changes:
- Compiles-out unused error reporting APIs and the related
data structures from safety build. For this purpose, it
introduces the new flag: CONFIG_NVGPU_INTR_DEBUG
- Updates nvgpu_report_err_to_sdl() API with one more argument,
hw_unit_id. This aids in finding whether an error to be reported
is corrected or uncorrected from LUT.
- Triggers SW quiesce, if an uncorrected error is reported to
Safety_Services, in safety build.
- Renames files in cic folder by replacing gv11b with ga10b,
since error reporting for gv11b is not supported in dev-main.
JIRA NVGPU-8002
Change-Id: Ic01e73b0208252abba1f615a2c98d770cdf41ca4
Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2668466
Reviewed-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com>
GVS: Gerrit_Virtual_Submit
For SMC mode, userspace is expected to use local indexing
for accessing GPC/FBP specific perf registers where local indexing
refers to indexes localized to a given SMC instance. H/W however expects
logical id based indexing for these registers. Currently, nvgpu driver maintains
a mapping between local <-> logical/physical ids of the GPCs for SMC specific
configurations/instances.
These register accesses are performed by the Debugger/Profiler interfaces and uses regops
for read/writes. In their current state, regops simply validates register addresses and performs
the required operation on them. These registers are currently indexed using local ids
and there is a need to convert them to use logical ids for supporting SMC modes. For non-SMC case
local ids are equivalent to logical ids and hence the conversion would have no effect on them.
Following changes are added to facilitate the above conversion from
local ids to logical ids in the regops path.
1) nvgpu_profiler_allowlist_range_search is modified to update
a nvgpu_pm_resource_register_range_map entry instead of just the
type.
2) added two APIs, one meant for profiler V2 based interfaces
and the other for legacy profiler interface. The logic for
legacy profiler interface extends into the more generic profiler
V2 logic to help retain future compatibility. These APIs are added
just after the validation stage for nvgpu_exec_regops.
3) The above APIs return an error if the local ids exceed the number
of GPCs/FBPs for a particular instance.
Bug 200712091
Signed-off-by: Debarshi Dutta <ddutta@nvidia.com>
Change-Id: I060c2408a798f2f4e058aba266fa1ea9cebc2682
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2644956
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
GVS: Gerrit_Virtual_Submit