This patch does the following changes:
- Compiles-out unused error reporting APIs and the related
data structures from safety build. For this purpose, it
introduces the new flag: CONFIG_NVGPU_INTR_DEBUG
- Updates nvgpu_report_err_to_sdl() API with one more argument,
hw_unit_id. This aids in finding whether an error to be reported
is corrected or uncorrected from LUT.
- Triggers SW quiesce, if an uncorrected error is reported to
Safety_Services, in safety build.
- Renames files in cic folder by replacing gv11b with ga10b,
since error reporting for gv11b is not supported in dev-main.
JIRA NVGPU-8002
Change-Id: Ic01e73b0208252abba1f615a2c98d770cdf41ca4
Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2668466
Reviewed-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com>
GVS: Gerrit_Virtual_Submit
In DRIVE 6.0, NvGPU is allowed to report only 32-bit metadata to
Safety_Services. So, there is no need to have distinct APIs for
reporting errors from units like GR, MM, FIFO to SDL unit. All
these error reporting APIs will be replaced with a single API. To
meet this objective, this patch does the following changes:
- Replaces nvgpu_report_*_err with nvgpu_report_err_to_sdl.
- Removes the reporting of error messages.
- Replaces nvgpu_log() with nvgpu_err(), for error reporting.
- Removes error reporting to Safety_Services from nvgpu_report_*_err.
However, nvgpu_report_*_err APIs and their related files are not
removed. During the creation of nvgpu-mon, they will be moved under
nvgpu-rm, in debug builds.
Note:
- There will be a follow-up patch to fix error IDs.
- As discussed in https://nvbugs/3491596 (comment #12), the high
level expectation is to report only errors.
JIRA NVGPU-7450
Change-Id: I428f2a9043086462754ac36a15edf6094985316f
Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2662590
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
- implemented device info cmd to send device info to the gsp for
runlist submission. Currently GSP scheduler support only GR
engine '0' instance.
- implemented runlist submit cmd. GSP firmware will submit the
corresponding runlist by writing into submit registers. This
command is direct replacement of hw_submit ga10b hal for GR engine.
NVGPU-6790
Signed-off-by: Ramesh Mylavarapu <rmylavarapu@nvidia.com>
Change-Id: I5dc573a6ad698fe20b49a3466a8e50b94cae74df
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2608923
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Most of the Orin chip specific code is compiled out of safety build
with CONFIG_NVGPU_NON_FUSA and CONFIG_NVGPU_HAL_NON_FUSA. Remove the
config protection from Orin/GA10B specific code. Currently all code
is enabled. Code not required in safety will be compiled out later
in separate activity.
Other noteworthy changes in this patch related to safety build:
- In ga10b_ce_request_idle(), add a log print to dump num_pce so that
compiler does not complain about unused variable num_pce.
- In ga10b_fifo_ctxsw_timeout_isr(), protect variables active_eng_id and
recover under CONFIG_NVGPU_KERNEL_MODE_SUBMIT to fix compilation
errors of unused variables.
- Compile out HAL gops.pbdma.force_ce_split() from safety since this HAL
is GA100 specific and not required for GA10B.
- Compile out gr_ga100_process_context_buffer_priv_segment() with
CONFIG_NVGPU_DEBUGGER.
- Compile out VAB support with CONFIG_NVGPU_HAL_NON_FUSA.
- In ga10b_gr_intr_handle_sw_method(), protect left_shift_by_2 variable
with appropriate configs to fix unused variable compilation error.
- In ga10b_intr_isr_stall_host2soc_3(), compile ELPG function calls
with CONFIG_NVGPU_POWER_PG.
- In ga10b_pmu_handle_swgen1_irq(), move whole function body under
CONFIG_NVGPU_FALCON_DEBUG to fix unused variable compilation errors.
- Add below TU104 specific files in safety build since some of the code
in those files is required for GA10B. Unnecessary code will be
compiled out later on.
hal/gr/init/gr_init_tu104.c
hal/class/class_tu104.c
hal/mc/mc_tu104.c
hal/fifo/usermode_tu104.c
hal/gr/falcon/gr_falcon_tu104.c
- Compile out GA10B specific debugger/profiler related files from
safety build.
- Disable CONFIG_NVGPU_FALCON_DEBUG from safety debug build temporarily
to work around compilation errors seen with keeping this config
enabled. Config will be re-enabled in safety debug build later.
Jira NVGPU-7276
Change-Id: I35f2489830ac083d52504ca411c3f1d96e72fc48
Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2627048
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
On certain platforms, not all copy engine instances are usable. The user
shouldn't submit any work to these engines. To enforce this, remove
these engines from active/host_engine list, this should ensure that these
engines do not get advertised to userspace. In order to accomplish this
introduce the following functions:
- nvgpu_engine_remove_one_dev: This function removes the specified device
entry from following device lists: fifo->host_engines, fifo->active_engines,
runlist->rl_dev_list, runlist->eng_bitmask.
Replace iteration over LCE device type entries using
nvgpu_device_for_each(g, dev, NVGPU_DEVTYPE_LCE), along with this introduce
macro nvgpu_device_for_each_safe.
Introduce gpu_dbg_ce flag for CE debugging.
Bug 3370462
Change-Id: I2e21f18363c6e53630d129da241c8fece106cd33
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2616711
Reviewed-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
The current runlist code assumes a single runlist buffer to hold all TSG
and channel entries. Create separate RL domain and domain memory types
to hold data that is related to only a scheduling domain and not
directly to the runlist hardware; in the future, more than one domains
may exist and one of them is enabled at a time.
The domain is used only internally by the runlist code at this point and
is functionally equivalent to the current runlist memory that houses the
round robin entries.
The double buffering is still kept, although more domains might benefit
from some cleverness. Although any number of created domains may be
edited in runtime, nly one runlist memory is accessed by the hardware at
a time. To spare some contiguous memory, this should be considered an
opportunity for optimization in the future.
Jira NVGPU-6425
Change-Id: Id99c55f058ad56daa48b732240f05b3195debfb1
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2618386
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
The macros defined within the C file in the form
(\
fb_mmu_l2tlb_ecc_status_corrected_err_l2tlb_sa_data_m() |\
fb_mmu_l2tlb_ecc_status_corrected_err_l2tlb1_sa_data_m() \
)
are difficult to detect correctly in libclang based static analyzers.
As a consequence, Hal Checker might be missing some coverage.
Such masks are converted into a static function format to help
mitigate this issue.
Change-Id: Id43e25abda8db4c79f7f6fc604eb6e76e9f6282c
Signed-off-by: Debarshi Dutta <ddutta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2598063
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
nvgpu_timeout_init() returns an error code only when the flags parameter
is invalid. There are very few possible values for flags, so extract the
two most common cases - cpu clock based and a retry based timeout - to
functions that cannot fail and thus return nothing. Adjust all callers
to use those, simplfying error handling quite a bit.
Change-Id: I985fe7fa988ebbae25601d15cf57fd48eda0c677
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2613833
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
There is HW specific limit on number of channel entries that can be
added for each TSG entry in runlist. Right now there is no checking
to enforce this from SW and hence if User binds more than supported
channels to same TSG, invalid TSG formation error interrupts are
generated.
Fix this by adding appropriate checks in below steps :
- Add new field ch_count to struct nvgpu_tsg to keep track of
channels bound to TSG.
- Define new hal gops.runlist.get_max_channels_per_tsg() to retrieve
HW specific maximum channel count per TSG.
- Implement the HAL for gk20a and gv11b chips, and assign new HALs for
all chips appropriately.
- Increment ch_count while binding the channel to TSG and decrement it
while unbinding.
- While binding channel to TSG, Check if current channel count is
already equal to max channel count. If yes, print an error and bail
out.
Bug 200763991
Change-Id: Ic5f17a52e0fb171d1c020bf4f085f57cdb95f923
Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2582095
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: Konsta Holtta <kholtta@nvidia.com>
Reviewed-by: svc_kernel_abi <svc_kernel_abi@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
common.cic unit is divided into common.cic.mon and common.cic.rm
based on rm and mon process split.
CIC-mon subunit includes the code which is utilized in critical
interrupt handling path like initialization, error detection and
error reporting path. CIC-rm subunit includes the code corresponding
to rest of interrupt handling(like collecting error debug data from
registers) and ISR status management (status of deferred interrupts).
Split the CIC APIs and data-members into above two subunits.
JIRA NVGPU-6899
Change-Id: I151b59105ff570607c4a62e974785e9c1323ef69
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2551897
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
common.ptimer unit right now exposes two APIs -
scale_ptimer() to scale the timer
ptimer_scalingfactor10x() to get the scaling factor
receiving scaling factor is not really necessary for user of
common.ptimer since it can be internally calculated in scale_ptimer()
function itself.
Hence make ptimer_scalingfactor10x() static and rename public API
scale_ptimer() to nvgpu_ptimer_scale()
nvgpu_ptimer_scale() will not accept timeout value as parameter
and return scaled timeout value in another pointer parameter.
Error code is returned if timeout value is invalid.
Jira NVGPU-6394
Change-Id: Ib882d99f6096c3af5f96eef298d713fb5e36dd87
Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2546970
(cherry picked from commit 2da7c918efe91046818c83481664312e194ead8e)
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2551334
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Replace all nvgpu_next functions/structs either by 1) collapsing them
into nvgpu legacy functions/structs 2) renaming them as follows:
- nvgpu_next_*() => nvgpu_(ga10b/ga100)_*()
- nvgpu_next_*() => (ga10b/ga100)_*()
- nvgpu_next_*() => nvgpu_*() [only if this doesn't cause collision]
- nvgpu_next_*() = > nvgpu_*_extra()
Create hal.sim unit and move Ampere+ SIM code into it.
Jira NVGPU-4771
Change-Id: I215594a0d0df4bd663bd875a0d0db47bcb9ff6a2
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2548056
Reviewed-by: svc_kernel_abi <svc_kernel_abi@nvidia.com>
Reviewed-by: Mahantesh Kumbar <mkumbar@nvidia.com>
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
The CONFIG_NVGPU_NEXT config is no longer required now that ga10b and
ga100 sources have been collapsed. However, the ga100, ga10b sources
are not safety certified, so mark them as NON_FUSA by replacing
CONFIG_NVGPU_NEXT with CONFIG_NVGPU_NON_FUSA.
Move CONFIG_NVGPU_MIG to Makefile.linux.config and enable MIG support
by default on standard build.
Jira NVGPU-4771
Change-Id: Idc5861fe71d9d510766cf242c6858e2faf97d7d0
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2547092
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
CIC (Central Interrupt controller) will be responsible for the
interrupt handling. common.cic unit is the placeholder for all
interrupt related code. Move interrupt related defines and
Public APIs present in common.mc to common.cic.
Note: The common.mc interrupts related struct definitions are
not moved as part of this patch.
Adapt the code to use interrupt handling related defines and public
APIs migrated from common.mc to common.cic
JIRA NVGPU-6899
Change-Id: I747e2b556c0dd66d58d74ee5bb36768b9370d276
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2535618
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: svc_kernel_abi <svc_kernel_abi@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Update the nvgpu_runlist_update_for_channel() function:
- Rename it to nvgpu_runlist_update()
- Have it take a pointer to the runlist to update instead
of a runlist ID. For the most part this makes the code
better but there's a few places where it's worse (for
now).
This starts the slow and painful process of moving away from
the non-runlist code using runlist IDs in many places it should
not.
Most of this patch is just fixing compilation problems with
the minor header updates.
JIRA NVGPU-6425
Change-Id: Id9885fe655d1d750625a1c8aceda9e67a2cbdb7a
Signed-off-by: Alex Waterman <alexw@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2470304
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Rename struct nvgpu_runlist_info to struct nvgpu_runlist; the
info is not necessary. struct nvgpu_runlist is soon to be a
first class object among the nvgpu object model.
Also rename the fields runlist_info and active_runlist_info to
simply runlists and active_runlists respectively. Again the info
text is just not necessary and somewhat misleading. These structs
_are_ the runlist representations in SW; they are not merely
informational.
Also add an rl_dbg() macro to print debug info specific to
runlist management and some debug prints specifying the runlist
topology for the running chip.
Change-Id: Id9fcbdd1a7227cb5f8c75cca4abbff94fe048e49
Signed-off-by: Alex Waterman <alexw@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2470303
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
The NEXT bit can remain set for the channel if timeslice expires before
scheduler clears it. Due to this nvgpu fails TSG unbind and in turn
nvrm_gpu fails channel close. In this case, checking the channel hw
state after some time can help see NEXT bit cleared by scheduler.
Reenable the tsg and return -EAGAIN to nvrm_gpu for it to retry again.
Bug 3144960
Change-Id: I35f417f02270e371a4e632986b73a00f8a4f921a
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2468391
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
The simulator ring buffer DMA interface supports buffers of the following sizes:
4, 8, 12 and 16K. At present, it is configured to 4K and it happens to match
with the kernel PAGE_SIZE, which is used to wrap back the GET/PUT pointers once
4K is reached. However, this is not always true; for instance, take 64K pages.
Hence, replace PAGE_SIZE with SIM_BFR_SIZE.
Introduce macro NVGPU_CPU_PAGE_SIZE which aliases to PAGE_SIZE and replace
latter with former.
Bug 200658101
Jira NVGPU-6018
Change-Id: I83cc62b87291734015c51f3e5a98173549e065de
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2420728
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Remove current mc hals
- mc.reset()
- mc.enable()
- mc.disable()
- mc.reset_mask()
- mc.reset_engine()
- mc.reset_engine_enable()
Add new mc hals
- mc.enable_units(g, units, enable)
> enable/disable given unit(s)
- mc.enable_dev(g, dev, enable)
> enable/disable engine represented by given device pointer
- mc.enable_devtype(g, devtype)
> enable/disable all engines of given devtype
Move common mc intr functions to common/mc/mc_intr.c.
Add below common mc functions
- nvgpu_mc_reset_units(g, units)
> reset given logical OR of nvgpu unit bitmap
- nvgpu_mc_reset_dev(g, dev)
> reset given single engine via dev
> if engine is graphics, reset gpcs for nvgpu_next
- nvgpu_mc_reset_devtype(g, devtype)
> reset all engines of given devtype
> if devtype is graphics, reset gpcs for nvgpu_next
Bug 200648985
Bug 3109773
Change-Id: Idc67a14a0a7cde83de44fbfbec13007fead3ed5c
Signed-off-by: Vedashree Vidwans <vvidwans@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2408523
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Following changes are made in this patch.
1) Change unnamed structs within gpu_ops to named structs
with the prefix gops_*.
2) Each named struct gops_ are moved into a separate gops specific file
under include/nvgpu/gops/
3) struct gpu_ops is moved into a separate file include/nvgpu/gpu_ops.h
and all other dependent struct gops_* are included in this header.
4) Direct references to include/nvgpu/gops are removed from files as its enough
to include gk20a.h.
Change-Id: Ieb22cb853be567e3bef14f5f8a04674eebd902ea
Signed-off-by: Debarshi Dutta <ddutta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2398776
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
When a pbdma fault needs a channel teardown, do the recovery/teardown
process before acking the pbdma interrupt status back. Acking it causes
the hardware to proceed which could release fences too early before the
involved channel(s) have been found to be broken.
With these host copyengine interrupts, the teardown sequence is light
and proceeds even with the pbdma intr flag still set; there are no
engines to reset when these pbdma launch check interrupts happen. The
bad tsg is just disabled and the channels in it aborted.
A few unit tests are so heavily affected by this refactor that they
would need to be rewritten. They're not strictly needed at the moment,
so do only half of the rewrite: just delete them.
Bug 200611198
Change-Id: Id126fb158b6d05e46ba124cd426389046eedc053
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2392669
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Delete the struct nvgpu_engine_info as it's essentially identical to
struct nvgpu_device. Duplicating data structures is not ideal as it's
terribly confusing what does what.
Update all uses of nvgpu_engine_info to use struct nvgpu_device. This
is often a fairly straight forward replacement. Couple of places though
where things got interesting:
- The enum_type that engine_info uses is defined in engines.h and
has a bit of SW abstraction - in particular the GRCE type. The only
place this seemed to be actually relevant (the IOCTL providing device
info to userspace) the GRCE engines can be worked out by comparing
runlist ID.
- Addition of masks based on intr_id and reset_id; those can be
computed easily enough using BIT32() but this is an area that
could be improved on.
This reaches into a lot of extraneous code that traverses the fifo
active engines list and dramtically simplifies this. Now, instead of
having to go through a table of engine IDs that point to the list of
all host engines, the active engine list is just a list of pointers to
valid engines. It's now trivial to do a for-all-active-engines type
loop. This could even be turned into a generic macro or otherwise
abstracted in the future.
JIRA NVGPU-5421
Change-Id: I3a810deb55a7dd8c09836fd2dae85d3e28eb23cf
Signed-off-by: Alex Waterman <alexw@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2319895
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
During recovery, we set ch->unserviceable at the end after we preempt
the TSG and reset the engines. It might be too late and user-space
might submit more work to the broken channel which is not desirable.
Move setting this unserviceable flag right at the start
of recovery sequence.
Another thread doing a submit can still read the unserviceable flag
just before it is set here, leaving that submit stuck if recovery
completes before the submit thread advances enough to set up a post
fence visible for other threads. This could be fixed with a big lock
or with a double check at the end of the submit code after the job
data has been made visible.
We still release the fences, semaphore and error notifier wait queues
at the end; so user-space would not trigger channel unbind while
channel is being recovered.
Also, change the handle_mmu_fault APIs to return void as the
debug_dump return value is not used in any of the caller APIs.
JIRA NVGPU-5843
Change-Id: Ib42c2816dd1dca542e4f630805411cab75fad90e
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2385256
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Konsta Holtta <kholtta@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Currently the vGPU engine management rewrites a lot of the common
device agnostic engine management code.
With the new top HAL parsing one device at a time, it is now more
easily possible to tie the vGPU into the new common device framework
by implementing the top HAL but with the vGPU engine list backend.
This lets the vGPU inherit all the common engine and device
management code. By doing so the vGPU HAL need only implement a
trivial and simple HAL.
This also gets us a step closer to merging all of the CE init
code: logically it just iterates through all CE engines whatever
they may be. The only reason this differs between chips is because
of the swap from CE0-2 to LCEs in the Pascal generation. This could
be abstracted by the unit code easily enough.
Also, the pbdma_id for each engine has to be added to the device
struct. Eventually this was going to happen anyway, since the
device struct will soon replace the nvgpu_engine_info struct.
It's a little bit of an abuse but might be worth it long term. If
not, it should not be difficult to replace uses of dev->pbdma_id
with a proper lookup of PBDMA ID based on the device info.
JIRA NVGPU-5421
Change-Id: Ie8dcd3b0150184d58ca0f78940c2e7ca72994e64
Signed-off-by: Alex Waterman <alexw@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2351877
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Remove an ancient piece of code that morphed from a hard coded fault ID
lookup in the original gk20a driver. In the old days there was no top
parsing code, so converting between engine ID and fault ID was done
by a simle function.
This code derived from that function for some reason. However, given
that the HW top table is not actually broken, this code never executes.
The only way this code would execute is if the HW top table reported
that the fault ID for a GRCE engine is 0. But this never happens.
JIRA NVGPU-5420
Change-Id: If8483faa9878f752c29ef6eadc1a56ce1de81942
Signed-off-by: Alex Waterman <alexw@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2362865
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>