Currently the vGPU engine management rewrites a lot of the common
device agnostic engine management code.
With the new top HAL parsing one device at a time, it is now more
easily possible to tie the vGPU into the new common device framework
by implementing the top HAL but with the vGPU engine list backend.
This lets the vGPU inherit all the common engine and device
management code. By doing so the vGPU HAL need only implement a
trivial and simple HAL.
This also gets us a step closer to merging all of the CE init
code: logically it just iterates through all CE engines whatever
they may be. The only reason this differs between chips is because
of the swap from CE0-2 to LCEs in the Pascal generation. This could
be abstracted by the unit code easily enough.
Also, the pbdma_id for each engine has to be added to the device
struct. Eventually this was going to happen anyway, since the
device struct will soon replace the nvgpu_engine_info struct.
It's a little bit of an abuse but might be worth it long term. If
not, it should not be difficult to replace uses of dev->pbdma_id
with a proper lookup of PBDMA ID based on the device info.
JIRA NVGPU-5421
Change-Id: Ie8dcd3b0150184d58ca0f78940c2e7ca72994e64
Signed-off-by: Alex Waterman <alexw@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2351877
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Implement empty stubs of the channel watchdog functions for when
watchdog is disabled from build. Add some forward declarations that were
missing. Now most call sites don't need #idefs for the build flag.
Add error checks for the wdt alloc failure.
Jira NVGPU-5494
Jira NVGPU-5493
Change-Id: I2d42e8ab4c5e045cd280b2e1f254396127bd154b
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2352050
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Preempt TSG occurs in non-mission mode, when unbinding channel
from TSG, or aborting TSG. Should a preempt not complete on
engine, we expect other HW safety mechanisms such as FECS
watchdog to detect issues that prevented saving current context.
Add BUG_ON when attempting to recover from preempt timeout,
to make sure we got such error, and sw_quiesce has been
requested.
Jira NVGPU-4230
Change-Id: Ia26a61e703f74eb28d29e72e75664ca4ec97a586
Signed-off-by: Thomas Fleury <tfleury@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2265082
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Runlist update occurs in non-mission mode, when
adding/removing channel/TSGs. The pending bit
is a debug only feature. As a result logging a
warning is sufficient.
We expect other HW safety mechanisms such as
PBDMA timeout to detect issues that caused pending
to not clear. It's possible bad base address could
cause some MMU faults too.
Worst case we rely on the application level task
monitor to detect the GPU tasks are not completing
on time.
Jira NVGPU-4322
Change-Id: I7233770349db5dfad6904170a1e9a2d5eada70b2
Signed-off-by: Thomas Fleury <tfleury@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2265094
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Vinod Gopalakrishnakurup <vinodg@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
For MMU and PBDMA faults, error notifier needs to be set
before entering SW quiesce. Otherwise it ends up with
default NVGPU_ERR_NOTIFIER_FIFO_ERROR_IDLE_TIMEOUT.
Added nvgpu_rc_mmu_fault to:
- call g->ops.fifo.recover when recovery is enabled
- set MMU error when recovery is disabled
Updated nvgpu_rc_pbdma_fault to set PBDMA error when
recovery is disabled as well.
Wait for deferred interrupts to complete before actually
entering SW quiesce state, to make sure error notifier has
been set.
Jira NVGPU-4127
Change-Id: Ia84c723e021e397391c6c609d4bb96c06afdcc47
Signed-off-by: Thomas Fleury <tfleury@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2210909
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Advisory Rule 2.7 states that there should be no unused
parameters in functions.
This patch removes unused function parameters from the following:
* nvgpu_channel_ctxsw_timeout_debug_dump_state()
* nvgpu_channel_destroy()
* nvgpu_tsg_destroy()
* nvgpu_rc_pdbma_fault()
Jira NVGPU-3178
Change-Id: I12ad0d287fd7980533663a9776428ef5d4fd1fb9
Signed-off-by: Scott Long <scottl@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2176066
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Some recovery functions are currently exported in libnvgpu_safe.export.
Once CONFIG_NVGPU_RECOVERY is permanently disabled for safety build,
we can remove those functions from the export file.
Until we can disable it, make sure that related functions do exist,
even when CONFIG_NVGPU_RECOVERY is disabled, by using #ifdefs inside
the functions, instead of redeclaring functions as static inline.
Jira NVGPU-3871
Change-Id: Ib682ae81268b35cd1050a55cc73653fb6637b87c
Signed-off-by: Thomas Fleury <tfleury@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2170433
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
nvgpu_rc_pbdma_fault just checks for the id and id_type from struct
nvgpu_pbdma_status_info. These contain invalid values during chsw_load
and chsw_switch. This patch corrects the above bug by checking for the
chsw status and then loading the values for id and type.
The current code reads the pbdma_status info after clearing the
interrupt. Other interrupts can cause enough delay between clearing the
interrupt and pbdma switching the channel leading to invalid channel/tsg
ID. Correct that by reading the pbdma_status info register before
clearing of the pbdma interrupt to correctly read the context
information before the pbdma can switch out the context.
Bug 2648298
Change-Id: Ic2f0682526e00d14ad58f0411472f34388183f2b
Signed-off-by: Debarshi Dutta <ddutta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2165047
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Remove below hals since the corresponding functions are same on all
platforms and they are h/w independent
g->ops.gr.enable_ctxsw()
g->ops.gr.disable_ctxsw()
g->ops.gr.reset()
Call the functions directly at all places
Remove CONFIG_NVGPU_DEBUGGER from places where these functions are
called since they are not debugger dependent
This also helps to disable CONFIG_NVGPU_DEBUGGER and to keep recovery
sequence intact
Jira NVGPU-3506
Change-Id: Id2b208ca23dc4667e78edcd8ad242a8558e0ff64
Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2137255
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Vinod Gopalakrishnakurup <vinodg@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Move chip specific recovery code for volta onwards
architecture to hal/rc/rc_gv11b.c
Rename
fifo.teardown_ch_tsg -> fifo.recover
gk20a_runlist_update_locked -> nvgpu_runlist_update_locked
Remove
Unused h/w headers from fifo_gv11b.c
Use local variable f instead of g->fifo
JIRA NVGPU-1314
Change-Id: Ia535bbe4780e7241fdd911a8f577c6b98cf0fe53
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2102897
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Basic units like fifo, rc are having dependency on
gr_falcon. Avoided outside gr units dependency on gr_falcon
by moving following functions to gr:
int nvgpu_gr_falcon_disable_ctxsw(struct gk20a *g,
struct nvgpu_gr_falcon *falcon); ->
int nvgpu_gr_disable_ctxsw(struct gk20a *g);
int nvgpu_gr_falcon_enable_ctxsw(struct gk20a *g,
struct nvgpu_gr_falcon *falcon); ->
int nvgpu_gr_enable_ctxsw(struct gk20a *g);
int nvgpu_gr_falcon_halt_pipe(struct gk20a *g); ->
int nvgpu_gr_halt_pipe(struct gk20a *g);
HALs also moved accordingly and updated code to reflect this.
Also moved following data back to gr from gr_falcon:
struct nvgpu_mutex ctxsw_disable_mutex;
int ctxsw_disable_count;
JIRA NVGPU-3168
Change-Id: I2bdd4a646b6f87df4c835638fc83c061acf4051e
Signed-off-by: Seshendra Gadagottu <sgadagottu@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2100009
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Removed unused struct from gr_gk20a.h
Change static allocation for struct gr_gk20a to dynamic type.
Change all the files that being affected by that change.
Call gr allocation from corresponding init_support functions, which
are part of the probe functions.
nvgpu_pci_init_support in pci.c
vgpu_init_support in vgpu_linux.c
gk20a_init_support in module.c
Call gr free before the gk20a free call in nvgpu_free_gk20a.
Rename struct gr_gk20a to struct nvgpu_gr
JIRA NVGPU-3132
Change-Id: Ief5e664521f141c7378c4044ed0df5f03ba06fca
Signed-off-by: Vinod G <vinodg@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2095798
Reviewed-by: Seshendra Gadagottu <sgadagottu@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Delete apply_ctxsw_timeout_intr ops and add
ctxsw_timeout_enable ops
Move chip specific sched_error and ctxsw_timeout
functions to hal/fifo/fifo_intr_* and hal/fifo/ctxsw_timeout_*
Add nvgpu_rc_ctxsw_timeout function under common/rc/rc.c
Do not check ctxsw timeout for channels that are no more
bound to tsg.
JIRA NVGPU-1312
Change-Id: Ide977fb60b3b72a27d9f22873f7a416c3bd1181d
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2075734
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>