Currently the vGPU engine management rewrites a lot of the common
device agnostic engine management code.
With the new top HAL parsing one device at a time, it is now more
easily possible to tie the vGPU into the new common device framework
by implementing the top HAL but with the vGPU engine list backend.
This lets the vGPU inherit all the common engine and device
management code. By doing so the vGPU HAL need only implement a
trivial and simple HAL.
This also gets us a step closer to merging all of the CE init
code: logically it just iterates through all CE engines whatever
they may be. The only reason this differs between chips is because
of the swap from CE0-2 to LCEs in the Pascal generation. This could
be abstracted by the unit code easily enough.
Also, the pbdma_id for each engine has to be added to the device
struct. Eventually this was going to happen anyway, since the
device struct will soon replace the nvgpu_engine_info struct.
It's a little bit of an abuse but might be worth it long term. If
not, it should not be difficult to replace uses of dev->pbdma_id
with a proper lookup of PBDMA ID based on the device info.
JIRA NVGPU-5421
Change-Id: Ie8dcd3b0150184d58ca0f78940c2e7ca72994e64
Signed-off-by: Alex Waterman <alexw@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2351877
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Add tu104 specific HAL tu104_gr_falcon_ctrl_ctxsw() that processes below
CTXSW methods to start/stop SMPC global mode :
NVGPU_GR_FALCON_METHOD_START_SMPC_GLOBAL_MODE
NVGPU_GR_FALCON_METHOD_STOP_SMPC_GLOBAL_MODE
Add new tu104 specific HAL tu104_gr_update_smpc_global_mode() to trigger
SMPC global mode start/stop using gops.gr.falcon.ctrl_ctxsw().
Update nvgpu_dbg_gpu_ioctl_smpc_ctxsw_mode() to enable/disable SMPC
global mode if channel is not bound to debug session.
Bug 2510974
Bug 2257799
Jira NVGPU-5360
Change-Id: I1f9d8f2a2d30a4738f291db3fc72c400d24f4048
Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2368696
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Current PM resource reservation system is limited to HWPM resources
only. And reservation tracking is done using boolean variables.
New upcoming profiler support requires reservation for all the PM
resources like SMPC and PMA stream. Using boolean variables is
not scalable and confusing. Plus the variables have to be replicated
on gpu server in case of virtualization.
Remove flag tracking mechanism and use list based approach to track
all PM reservations. Also, current HALs are defined on debugger object.
Implement new HALs in new pm_reservation object since it is really an
independent functionality.
Add new source file common/profiler/pm_reservation.c which implements
functions to reserve/release resources and to check if any resource
is reserved or not.
Add common/vgpu/pm_reservation_vgpu.c for vGPU which simply forwards
the request to gpu server.
Define new HAL object gops.pm_reservation and assign above functions
to below respective HALs :
g->ops.pm_reservation.acquire()
g->ops.pm_reservation.release()
g->ops.pm_reservation.release_all_per_vmid()
Last HAL above is only used for gpu server cleanup of guest OS.
Add below new common profiler functions that act as APIs to reserve/
release resources for rest of the units in nvgpu.
nvgpu_profiler_pm_resource_reserve()
nvgpu_profiler_pm_resource_release()
Initialize the meta data required for reservtion system in
nvgpu_pm_reservation_init() and call it during nvgpu_finalize_poweron.
Clean up the meta data before releasing struct gk20a.
Delete below HALs :
g->ops.debugger.check_and_set_global_reservation()
g->ops.debugger.check_and_set_context_reservation()
g->ops.debugger.release_profiler_reservation()
Bug 2510974
Jira NVGPU-5360
Change-Id: I4d9f89c58c791b3b2e63099a8a603462e5319222
Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2367224
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
1) In MIG mode, 2D, 3D, I2M and ZBC classes are not supported by
GR engine. NvGpu shall expose the HWCaps through
"struct nvgpu_gpu_characteristics".
2) NvGpu shall expose the following MIG related new caps through
"struct nvgpu_gpu_characteristics".
* mig_enabled - Flag to indicate whether MIG is enabled/disabled.
* gpu_instance_id - GPU instaces Id.
* gr_instance_id - graphics execution unit id.
* gr_sys_pipe_id - Sys pipe id of GR engine.
3) populate num_ppc_per_gpc - Pixel Processing cluster per GPC
4) populate max_veid_count_per_tsg - Maximum veid count per TSG
5) populate num_sub_partition_per_fbpa - Sub partition per FBPA.
JIRA NVGPU-5762
Change-Id: I06b5bcd3f568eb0b9c78c8fc6ce155b39aaeaba5
Signed-off-by: lm <lm@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2352100
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
TEGRA_VGPU_CMD_GET_ATTRIBUTE
TEGRA_VGPU_CMD_CHANNEL_FREE_GR_PATCH_CTX
TEGRA_VGPU_CMD_CHANNEL_UNMAP_GR_GLOBAL_CTX
TEGRA_VGPU_CMD_CHANNEL_SET_PRIORITY
TEGRA_VGPU_CMD_CHANNEL_SET_RUNLIST_INTERLEAVE
TEGRA_VGPU_CMD_CHANNEL_SET_TIMESLICE
TEGRA_VGPU_CMD_CHANNEL_FREE_HWPM_CTX
The above commands which are not used by clients anymore are being
removed.
Jira GVSCI-5155
Signed-off-by: Richard Zhao <rizhao@nvidia.com>
Change-Id: If5eef090308e6471a0e7aadf78878f1ad798ee9a
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2367556
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Unify the job metadata handling by deleting the parts that have handled
dynamically allocated job structs and fences. Now a channel can be in
one less mode than before which reduces branching in tricky places and
makes the submit/cleanup sequence easier to understand.
While preallocating all the resources upfront may increase average
memory consumption by some kilobytes, users of channels have to supply
the worst case numbers anyway and this preallocation has been already
done on deterministic channels.
Flip the channel_joblist_delete() call in nvgpu_channel_clean_up_jobs()
to be done after nvgpu_channel_free_job(). Deleting from the list (which
is a ringbuffer) makes it possible to reuse the job again, so the job
must be freed before that. The comment about using post_fence is no
longer valid; nvgpu_channel_abort() does not use fences.
This inverse order has not posed problems before because it's been buggy
only for deterministic channels, and such channels do not do the cleanup
asynchronously so no races are possible. With preallocated job list for
all channels, this would have become a problem.
Jira NVGPU-5492
Change-Id: I085066b0c9c2475e38be885a275d7be629725d64
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2346064
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: Debarshi Dutta <ddutta@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Commit bd1ae5c9e1 ("gpu: nvgpu: fix MISRA 17.7 violations in mm") did
a seemingly harmless looking change in the cmpxchg() wrapper macro to
convert from atomic_compare_exchange_strong() to nvgpu_atomic_cmpxchg().
The latter is ultimately a wrapper for the former but the semantics are
different: the former takes old as a pointer and updates it for the read
value, while the latter takes it as a value and returns the read value.
The commit caused cmpxchg() to always return the old value, so a failing
compare has never been detected in a year and half.
This cmpxchg() is used only in the lockless allocator which is used only
in the fence code in deterministic kernel submits which hasn't been part
of safe code, so the broken code has been basically not used. (The
typecast from an integer pointer to an atomic pointer is a separate
concern.)
Jira NVGPU-5493
Change-Id: I932a69c6c185783c0e514e848e0ee6057ce74888
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2368118
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
On Linux, nvgpu mapping ioctl provides option to specify the access
type flags for the mapping. This support is not implemented for
other OS. For nvrm_gpu to know when to set these flags add new
enabled flag *_MAP_ACCESS_TYPE that is enabled only for Linux.
Bug 200621157
Change-Id: If1397bb0d5fdc5589458d92f24647afa586af1c2
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2363829
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Until now, all userspace buffers were mapped in the GMMU as Read & Write
(RW) by default. In order to enable the use cases which require the GPU
to only read the SYSMEM buffers and not inadvertently write to those,
map buffer ioctls need to provide interface to set the mapping access
type from the userspace.
Some of the use cases are:
1. A third party server process exposes shared memory that is
read-only to the client process, which does the GPU processing.
Registering this memory using cudaHostRegister API as read-only
in the client process will restict the access to Read Only type
from the GPU.
2. IO devices exposing streaming read-only data for processing by
the GPU.
3. For marking semantically read-only data as actually read-only
for the purposes of debugging data corruption.
This patch introduces new AS buffer mapping bitmask flag and
corresponding core VM mapping bitmask flag for representing
Read Only (RO) access type. By default, the access is set
as Read Write (RW).
Bug 200621157
Change-Id: I5ec9dec3ce089e577b86c43003d92b61eee4a90b
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2361750
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Below APIs to update hwpm/smpc ctxsw mode take a channel pointer as a
parameter. APIs then extract corresponding TSG from channel and perform
various operations on context stored in TSG.
g->ops.gr.update_smpc_ctxsw_mode()
g->ops.gr.update_hwpm_ctxsw_mode()
Update both above APIs to accept TSG pointer instead of a channel.
This is a refactor work to support new profiler design where a profiler
object is bound to TSG and keeps track of TSG only.
Bug 2510974
Jira NVGPU-5360
Change-Id: Ia4cefda503d8420f2bd32d07c57534924f0f557a
Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2366122
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Move profiler object allocation/free APIs to separate profiler
specific file common/profiler.c.
Store struct gk20a pointer in struct dbg_profiler_object_data for
convenience of accessing global struct pointer.
Update profiler object to store TSG pointer instead of channel
pointer. Since expectations is to have one profiler object
per context/TSG.
nvgpu_profiler_reserve_acquire() has a case to check if resource
reservation is acquired by some other channel in TSG.
But now since we keep track of TSG itself, this case becomes
redundant and can be removed.
All the support is compiled out of safety build with compile
flag CONFIG_NVGPU_PROFILER.
Linux will always compile the support.
Bug 2510974
Change-Id: I197bbd67a9cdd1fbea42f1effd1b74b15a6068e5
Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2365674
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
quad type reg_ops were only needed on Kepler, and not for any other chip
beginning Maxweel.
HAL g->ops.gr.access_smpc_reg() was incorrectly set for Volta and Turing
whereas it was only applicable to Kepler. Delete it.
There is no register in the quad type whitelist since the type itself is
not supported anymore. Remove the empty whitelists for all chips and
also delete below HALs:
g->ops.regops.get_qctl_whitelist()
g->ops.regops.get_qctl_whitelist_count()
hal/regops/regops_gv100.* files are not used anymore. Delete the files
instead of just deleting quad HALs in these files.
Bug 200628391
Change-Id: I4dcc04bef5c24eb4d63d913f492a8c00543163a2
Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2366035
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
This patch updates nvgpu_assert as macro to print the information
about the calling function. Specifically, to print the function
name and the line number details.
This patch introduces misra violations (misra_c_2012_rule_10_1_violation)
in nvgpu_assert(). However, leaving misra violations unfixed has low
safety impact since misra violations are coming after fatal error is
hit where GPU driver is not expected to be serviceable thereafter.
Further, this patch provides debug benefit in quickly finding the
function that lead to the exit of NvGPU process.
Bug 2964898
Change-Id: Iba85f4a9226742a0bb08b045bcbfa26949bbe746
Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2342086
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: Ankur Kishore <ankkishore@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Add the following gr gops functions:
- enable_gpc_crop_hww
- enable_gpc_zrop_hww
- handle_gpc_crop_hww
- handle_gpc_zrop_hww
- handle_gpc_rrh_hww
These gr gops will be used in nvgpu-next.
Add function: nvgpu_gr_rop_offset to compute rop pri offsets.
Jira: NVGPU-5237
Change-Id: I9e2437c1d2893238b16ec7a134543e20c81b49f7
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2335687
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
The FIFO pbdma map is an array of bit maps that link PBDMAs to runlists.
This array allows other software to query what PBDMA(s) serves a given
runlist. The PBDMA map is read verbatim from an array of host registers.
These registers are stored in a kmalloc()'ed array.
This causes a problem for the device management code. The device
management initialization executes well before the rest of the FIFO
PBDMA initialization occurs. Thus, if the device management code
queries the PBDMA mapping for a given device/runlist, the mapping has
yet to be populated.
In the next patches in this series the engine management code is subsumed
into the device management code. In other words the device struct is
reused by the engine management and all host SW does is pull pointers to
the host managed devices from the device manager. This means that all
engine initialization that used to be done on top of the device
management needs to move to the device code.
So, long story short, the PBDMA map needs to be read from the registers
directly, instead of an array that gets allocated long after the device
code has run.
This patch removes the pbdma map array, deletes two HALs that managed
that, and instead provides a new HAL to query this map directly from
the registers so that the device code can use it.
JIRA NVGPU-5421
Change-Id: I5966d440903faee640e3b41494d2caf4cd177b6d
Signed-off-by: Alex Waterman <alexw@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2361134
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: Konsta Holtta <kholtta@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Adjust documentation and validity checks in the fence functions for
simplicity.
Now that the cde code is using user fences cleanly, the
do-nothing-on-null action can cause unintended behaviour in new code
using nvgpu_fence_get and nvgpu_fence_put. It does not make sense to
call these with a null fence, so delete the checks.
Extend the documentation in nvgpu_fence_extract_user() for the os fence
lifetime to give a reason for the dup call.
Make nvgpu_fence_from_semaphore() and nvgpu_fence_from_syncpt() return
void. These fill a previously allocated object; the only failure would
have been a null object, but that never happens and is not acceptable
behaviour for callers so delete these null checks and fix types.
Jira NVGPU-5248
Change-Id: I9f82365d50ab5600374c8f7dd513691eac14a2f1
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2359624
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
The stored fence in struct gk20a_buffer_state is a post fence of a
previous cde preparation job, if any. This stored fence is passed to
userspace via NVGPU_GPU_IOCTL_PREPARE_COMPRESSIBLE_READ in case a
preparation job was necessary to fulfill the request. As nothing else is
needed from the fence, make it just a struct nvgpu_user_fence.
Add nvgpu_user_fence_clone() for copying this user fence because it's
stored internally and returned to userspace. The refcounted os fence
needs special care. Now that the API is not so trivial anymore, add some
documentation.
Jira NVGPU-5248
Jira NVGPU-5493
Change-Id: I8bc4d52eaab7c7cbc5573b331e72e1d853f9f057
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2359065
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
This patch removes the reporting of _ECC_CORRECTED errors which are
not applicable to GV11B. Specifically, this patch removes the code
related to the reporting of the following service IDs:
NVGUARD_SERVICE_IGPU_SM_SWERR_LRF_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_SM_SWERR_CBU_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_PMU_SWERR_FALCON_DMEM_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_GPCCS_SWERR_FALCON_DMEM_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_FECS_SWERR_FALCON_DMEM_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_GCC_SWERR_L15_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_MMU_SWERR_L1TLB_FA_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_MMU_SWERR_L1TLB_SA_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_HUBMMU_SWERR_L2TLB_SA_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_HUBMMU_SWERR_TLB_SA_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_HUBMMU_SWERR_PTE_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_HUBMMU_SWERR_PDE0_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_SM_SWERR_ICACHE_L0_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_SM_SWERR_L1_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_SM_SWERR_ICACHE_L0_PREDECODE_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_SM_SWERR_ICACHE_L1_DATA_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_SM_SWERR_ICACHE_L1_PREDECODE_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_SM_SWERR_L1_TAG_MISS_FIFO_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_SM_SWERR_L1_TAG_S2R_PIXPRF_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_LTC_SWERR_CACHE_TSTG_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_LTC_SWERR_CACHE_RSTG_ECC_CORRECTED
NVGUARD_SERVICE_IGPU_LTC_SWERR_CACHE_DSTG_BE_ECC_CORRECTED
Bug 200616002
Change-Id: I199c396f9f6a6be007bd6d3c556199b5a73c3c91
Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2349587
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Decouple the fence information needed for providing submit postfences to
userspace by adding a separate type for that and using it to pass fence
data to ioctls.
The data in struct nvgpu_fence_type is used in various places:
- job tracking needs to know when a post fence is expired
- job submitters within the driver (vidmem clears) need to be able to
wait for these fences
- userspace needs the fence as an id, value pair or as a file descriptor
created from an os fence
To keep object lifetimes strict, start decoupling the os fence data out
of struct nvgpu_fence_type: delete nvgpu_fence_install_fd() and add
nvgpu_fence_extract_user() to return a struct nvgpu_user_fence that
contains only the necessary information. Storing the os fence in job
tracking metadata is legacy code and not useful. Passing the os fence
from where it's created through the whole submit path inside this
combined fence type has been convenient, though.
The internally stored cde job fence in dmabuf compression metadata is
still nvgpu_fence_type to keep this patch simple.
Jira NVGPU-5248
Change-Id: I75b7da676fb6aa083828f888c55571bbf7645ef3
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2359064
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
The os fences can currently be constructed from a file descriptor, from
a raw syncpt id/value pair, or a struct nvgpu_semaphore. Each os fence
object has exactly one owner for simplicity as the owner is a wrapper
for a refcounted object. This does not allow copying the fences, so
extend struct nvgpu_os_fence_ops with a member to increment the refcount
of the underlying fence. This can be used to "duplicate" the object. The
copy needs an eventual call to ops->drop_ref() to release the refcount.
This will be useful to decouple the features of struct nvgpu_fence_type
needed in the kernel and those needed for userspace.
Jira NVGPU-5248
Change-Id: Ie7b943f0851f62842e941a7283b389bac84ae9ae
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2359063
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Nvgpu does not support nested interrupts and as a result priv/pbus
interrupt do not reach cpu while other interrupts on intr_0 (stall)
tree are being processed. This issue is not specific to priv/pbus
but since pbus errors are critical, it is important to detect it
early on.
Below is the snippet from one of the failing logs where nvgpu
is doing recovery to process gr interrupt.
Right after GR engine is reset (PGRAPH of PMC_ENABLE), failing priv
accesses should have triggered pbus interrupt but it does not reach cpu
until gr interrupt is handled. Any interrupt that requires recovery will
take longer to finish isr as recovery is done as part of isr.
Also intr_0 (stall) interrupts are paused while stall interrupt is being
processed.
gm20b_gr_falcon_bind_instblk:147 [ERR] arbiter idle timeout, status: badf1020
gm20b_gr_falcon_wait_for_fecs_arb_idle:125 [ERR] arbiter idle timeout, fecs ctxsw status: 0xbadf1020
Fix to detect pbus intr while other stall interrupts are being processed
is to move pbus intr enable/disable/clear/handle to nonstall (intr_1)
tree. Configure pbus_intr_en_1 to route pbus to nostall tree.
Priv interrupts cannot be moved to nonstall (intr_1) tree due
to h/w not supporting this.
In Turing, moving pbus intr to nonstall is not feasible as mc_intr(1)
tree is deprecated. Add Turing specific stall intr handler hals with
original logic to route pbus intr to mc_intr(0).
JIRA NVGPU-25
Bug 200603566
Change-Id: I36fc376800802f20a0ea581b4f787bcc6c73ec7e
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2354192
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Add a generic profiler based on the channel kickoff profiler. This
aims to provide a mechanism to allow engineers to (more) easily profile
arbitrary software paths within nvgpu.
Usage of this profiler is still primarily through debugfs. Next up is
a generic debugfs interface for this profiler in the Linux code.
The end goal for this is to profile the recovery code and generate
interesting statistics.
JIRA NVGPU-5606
Signed-off-by: Alex Waterman <alexw@nvidia.com>
Change-Id: I99783ec7e5143855845bde4e98760ff43350456d
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2355319
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
This adds a new device management unit in the common code responsible
for facilitating the parsing of the GPU top device list and providing
that info to other units in nvgpu.
The basic idea is to read this list once from HW and store it in a
set of lists corresponding to each device type (graphics, LCE, etc).
Many of the HALs in top can be deleted and instead implemented using
common code parsing the SW representation.
Every time the driver queries the device list it does so using a
device type and instance ID. This is common code. The HAL is responsible
for populating the device list in such a way that the driver can
query it in a chip agnostic manner.
Also delete some of the unit tests for functions that no longer
exist. This code will require new unit tests in time; those should be
quite simple to write once unit testing is needed.
JIRA NVGPU-5421
Change-Id: Ie41cd255404b90ae0376098a2d6e9f9abdd3f5ea
Signed-off-by: Alex Waterman <alexw@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2319649
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
The valid flag in struct nvgpu_fence_type is not very useful. It's set
when a fence is created on an allocated object and read in these three
scenarios:
- nvgpu_fence_install_fd() after a submit, if the submit was successful.
A successful submit implies that a post fence exists.
- nvgpu_fence_wait() for a copyengine job when synchronizing the ce
ringbuffer or when waiting for vidmem clears. In these cases the fence
is also clearly always valid.
- nvgpu_fence_is_expired() when testing whether a tracked job has
completed. Such jobs cannot exist without post fences that are
mandatory for tracking, so the fence must exist.
Remove the valid flag. Remove also the other init checks from the above
functions; they're equally unused and confusing implying that such calls
would be acceptable, causing sloppy code at best.
Jira NVGPU-5248
Jira NVGPU-5493
Change-Id: I52c5be1569b343024d2626bd9577f87b46064fba
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2357828
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
The differences between sync_fence ("android sync") and dma_fence are
abstracted away by nvhost in the nvhost_fence interface. There is no
need to have separate android and dma os fences for syncpoints; unify
the general implementation so that it's always used when requested for
the build.
Jira NVGPU-5386
Change-Id: Ia829e93e18d03064ff46ab1271547de2d1fb1cae
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2356158
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Implement empty stubs of the channel watchdog functions for when
watchdog is disabled from build. Add some forward declarations that were
missing. Now most call sites don't need #idefs for the build flag.
Add error checks for the wdt alloc failure.
Jira NVGPU-5494
Jira NVGPU-5493
Change-Id: I2d42e8ab4c5e045cd280b2e1f254396127bd154b
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2352050
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>