Update SW quiesce as follows:
- After waking up sw_quiesce_thread, nvgpu_sw_quiesce
masks interrupts, then disables and preempts runlists
without lock. There could be still a concurrent thread
that would re-enable the runlist by accident. This is
very unlikely and would mean we are not in mission mode
anyway.
- In sw_quiesce_thread, wait NVGPU_SW_QUIESCE_TIMEOUT_MS,
to leave some time for interrupt handler to set error
notifier (in case of HW error interrupt). Then disable
and preempt runlists, and set error notifier for remaining
channels before exiting the process.
Also modified nvgpu_can_busy to return false in case
SW quiesce is pending. This will make subsequent
devctl to fail.
Jira NVGPU-4512
Change-Id: I36dd554485f3b9b08f740f352f737ac4baa28746
Signed-off-by: Thomas Fleury <tfleury@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2266389
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
This patch adds boundary value check for common.fifo parameters as
listed below.
1. nvgpu_channel_setup_bind() includes a condition to check that value
of num_gpfifo_entries does not exceed 2^31. Otherwise prints message and
returns error.
2. nvgpu_tsg_bind_channel() includes a condition to check if channel
subctx had ASYNC id. If true, runqueue selector is set to 1 and 0
otherwise. This check is to be moved from devctl to common.fifo.
Jira NVGPU-4817
Change-Id: Id1c9253945859c245e584b5c42b3285a6b620055
Signed-off-by: Vedashree Vidwans <vvidwans@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2278613
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
nvgpu_engine_is_valid_runlist_id already iterates the list of
active engines, therefore the engine_id is already known to
be valid.
Remove call to nvgpu_engine_get_active_eng_info (which iterates
all engines), and fetch f->engine_info[engine_id] instead.
Also remove non-NULL test for engine_info, which could not
be true.
Also make sure to reset num_engines in nvgpu_cleanup_sw, to avoid
accessing uninitialized active_engines_list in unit test corner
cases (targetting init/remove support).
Jira NVGPU-4511
Change-Id: Ia6b904a7f3ca46e5097f06770b4caad317ec967b
Signed-off-by: Thomas Fleury <tfleury@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2263618
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
To achieve permanent fault coverage, the CTAs launched by
each kernel in the mission and redundant contexts must execute on
different hardware resources. This feature proposes modifications
in the software to modify the virtual SM id to TPC mapping across
the mission and redundant contexts. The virtual SM identifier to TPC
mapping is done by nvgpu when setting up the patch context.
The recommendation for the redundant setting is to offset the
assignment by one TPC, and not by one GPC. This will ensure that both
GPC and TPC diversity. The SM and Quadrant diversity will happen
naturally. For kernels with few CTAs, the diversity is guaranteed
to be 100%. In case of completely random CTA allocation,
e.g. large number of CTAs in the waiting queue, the diversity is
1 - 1/#SM, or 87.5% for GV11B, 97.9% for TU104.
Added NvGpu CFLAGS to enable/disable the SM diversity support
"CONFIG_NVGPU_SM_DIVERSITY".
This support is only enabled on gv11b and tu104 QNX non safety build.
JIRA NVGPU-4685
Change-Id: I8e3eaa72d8cf7aff97f61e4c2abd10b2afe0fe8b
Signed-off-by: Lakshmanan M <lm@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2268026
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: Seshendra Gadagottu <sgadagottu@nvidia.com>
Reviewed-by: Shashank Singh <shashsingh@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Automatic_Commit_Validation_User
GVS: Gerrit_Virtual_Submit
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
nvgpu_engine_get_gr_runlist_id gets the first instance of
active GR engine using nvgpu_engine_get_ids. Therefore the
engine_id is already known to be valid.
Remove call to nvgpu_engine_get_active_eng_info (which iterates
all engines), and fetch f->engine_info[engine_id] instead.
Also remove non-NULL test for engine_info, which could not
be true.
Jira NVGPU-4511
Change-Id: Ifcc0851e3d14d862e2ed7b21ea57f17a66eca9dd
Signed-off-by: Thomas Fleury <tfleury@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2263617
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
event_id_list and event_id_list_locks fields are only
needed in nvgpu_tsg when CONFIG_NVGPU_CHANNEL_TSG_CONTROL
is defined.
Conditionally compile those fields and related code,
so that they are removed from safety build.
Jira NVGPU-4376
Change-Id: I8678aa1b8cd4166aa37bcb42cda1eb9c703fd32f
Signed-off-by: Thomas Fleury <tfleury@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2273261
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Currently, GPU fifo sw_ready flag is not reset after fifo clean_up
execution. This patch resets g->fifo.sw_ready flag in
nvgpu_fifo_cleanup_sw_common() to indicate fifo attributes are reset.
Also, pbdma setup and cleanup functions are optional and may not be
populated. This patch modifies nvgpu_fifo_cleanup_sw_common() to
executes nvgpu_pbdma_cleanup_sw() if pbdma.cleanup_sw is populated.
Jira NVGPU-4339
Change-Id: I6fd53577afdd0a15c75f15b54a916e70e850d1b0
Signed-off-by: Vedashree Vidwans <vvidwans@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2237809
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Thomas Fleury <tfleury@nvidia.com>
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Fix bug where the CE mask includes other engine types besides just CEs
in nvgpu_ce_engine_interrupt_mask().
The intent of this API is to return mask of CE interrupts. However, the
if clause in the for loop is only excluding engine interrupts if the CE
stall or non-stall ISR is NULL. So, it does not distinquish between CE
or GR engine interrupts if the CE ISR is non-null.
Since the expectation is to not return CE interrupts if the ISRs are
NULL, just return a 0 mask if either ISR is NULL without having to
bother with the loop.
If the ISRs are set in the CE HAL, within the loop, only add interrupts
to the mask returned if the engine type is actually a CE engine (i.e. do
not include GR engine interrupts).
JIRA NVGPU-2224
Change-Id: Ic0048b00f16590fec50bb0858bd3f4498a00650d
Signed-off-by: Philip Elcan <pelcan@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2256269
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Previously, unit interrupt enabling/disabling and corresponding MC level
interrupt enabling/disabling was not done at the same time.
With this change, stall and nonstall interrupt for units are programmed
at MC level along with individual unit interrupts. Kept access to MC
interrupt registers through mc.intr_lock spinlock.
For doing this separated CE and GR interrupt mask functions.
mc.intr_enable is only used when there is global interrupt
control to be set. Removed mc_gp10b.c as mc_gp10b_intr_enable
is now removed. Removed following functions - mc_gv100_intr_enable,
mc_gv11b_intr_enable & intr_tu104_enable. Removed intr_pmu_unit_config
as we can use the generic unit interrupt control function.
JIRA NVGPU-4336
Change-Id: Ibd296d4a60fda6ba930f18f518ee56ab3f9dacad
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2196178
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
gk20a.h will include gops_mc.h to contain the mc ops definitions. Add
doxygen comments for the HAL functions that are called directly.
Also move mc_gp10b_intr_pmu_unit_config to non-fusa HAL file.
JIRA NVGPU-2524
Change-Id: I4f326332d7842211b004b372d79fac9fe6ed40e7
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2226017
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
IRQs can get triggered during nvgpu power-on due to MMU fault, invalid
PRIV ring or bus access etc. Handlers for those IRQs can't access the
full state related to the IRQ unless nvgpu is fully powered on.
In order to let the IRQ handlers know about the nvgpu power-on state
gk20a.power_on_state variable has to be protected through spinlock
to avoid the deadlock due to usage of earlier power_lock mutex.
Further the IRQs need to be disabled on local CPU while updating the
power state variable hence use spin_lock_irqsave and spin_unlock_-
irqrestore APIs for protecting the access.
JIRA NVGPU-1592
Change-Id: If5d1b5e2617ad90a68faa56ff47f62bb3f0b232b
Signed-off-by: Sagar Kamble <skamble@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2203860
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
This change eliminates MISRA Advisory Rule 18.4 violations in the
following cases:
* nvgpu_submit_append_gpfifo_user_direct()
* nvgpu_submit_append_gpfifo_common()
- use array-indexing to access gpfifo entry lists
* gv11b_gr_intr_record_sm_error_state()
- use array-indexing to access sm_error_states table
Advisory Rule 18.4 states that the +, -, +=, and -= operators should
not be applied to an expression of pointer type.
JIRA NVGPU-3798
Change-Id: I736930e4ba09a88888b0ef48f62496c4082ea5a1
Signed-off-by: Scott Long <scottl@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2210173
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
This change eliminates MISRA Advisory Rule 18.4
violations in the following by accessing g->fifo.channel
with array indexing:
* nvgpu_channel_init_support()
* nvgpu_channel_semaphore_wakeup()
Advisory Rule 18.4 states that the +, -, +=, and -=
operators should not be applied to an expression
of pointer type.
JIRA NVGPU-3798
Change-Id: I6b1bf360db6ec25894cc0ea430c33067e0cddf64
Signed-off-by: Scott Long <scottl@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2207550
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
nvgpu_kzalloc can fail in gv11b_init_eng_method_buffers.
Added checks on returned pointer.
Also changed g->ops.tsg.init_eng_method_buffers to return an int,
and check return value in callers.
Jira NVGPU-3788
Change-Id: Icb541665c40b89d512929cc9cf9f6a3e7a0033db
Signed-off-by: Thomas Fleury <tfleury@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2205851
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
This patch addresses misra violations due to SDL error reporting
callbacks. In particular, it addresses the following misra violation:
- misra_c_2012_directive_4_7_violation: Calling function
"nvgpu_report_*_err()" which returns error information without testing
the error information.
JIRA NVGPU-4025
Change-Id: Ia10b6b3fd9c127a8c5189c3b6ba316f243cedf04
Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2196895
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Add macros for whitelisting coverity violations. These macros use pragma
directives. The pragma directives and whitelisting macros are only
enabled when a coverity scan is being run.
The whitelisting macros have been added to a new header called
static_analysis.h. The contents of safe_ops.h (CERT C safe ops) have
been moved into static_analysis.h because this will be the new header
for static analysis related macros/defines/etc.
JIRA NVGPU-3820
Change-Id: I9c63f20f670880b420415535738034619314b7c3
Signed-off-by: Adeel Raza <araza@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2180600
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
The following misra violations are fixed in the current patch.
1) misra_c_2012_directive_4_7_violation: Calling function
"nvgpu_report_host_err" which returns error information without testing
the error information.
2) misra_c_2012_directive_4_7_violation: The variable "intr_0_en_mask"
which contains error information hasn't been tested.
3) misra_c_2012_directive_4_7_violation: Calling function
"gv11b_fifo_intr_0_error_mask(g)" which returns error information
without testing the error information.
4) misra_c_2012_rule_8_6_violation: "gk20a_fifo_bar1_snooping_disable"
is declared but never defined.
5) misra_c_2012_rule_8_6_violation: "gm20b_fuse_check_priv_security" is
declared but never defined.
6) misra_c_2012_rule_8_6_violation: "gm20b_fuse_status_opt_gpc" is
declared but never defined.
Jira NVGPU-3881
Change-Id: I731cd1d99649e07cb39aa75c4715e17eedd4d927
Signed-off-by: Debarshi Dutta <ddutta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2188161
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
CTS test dEQP-VK.api.object_management.max_concurrent.device_group
crashes with invalid userspace memory access.
Currently, nvgpu_submit_prepare_syncs() races with
nvgpu_channel_clean_up_jobs() and this race condition is exposed when
aggressive_sync_destroy_thresh is set to non-zero value.
nvgpu_submit_prepare_syncs() gets ref for c->sync to submit job and
releases channel sync_lock immediately. Meanwhile,
nvgpu_worker_poll_work() triggers nvgpu_channel_clean_up_jobs(), which
destroys ref'd c->sync pointer.
Channel sync is deleted by nvgpu_channel_clean_up_jobs() only if
aggressive_sync_destroy_thresh is non-zero.
So, nvgpu_channel_clean_up_jobs() and nvgpu_submit_prepare_syncs() will
race only in this scenario.
Hence, if aggressive_sync_destroy_thresh value is non-zero, this patch
protects channel's sync pointer by holding channel sync_lock
during complete execution of nvgpu_submit_prepare_syncs().
Bug 2613870
Change-Id: I030d8df7af10d4ed86f921b5cf60de2b1d60e5d3
Signed-off-by: Vedashree Vidwans <vvidwans@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2181360
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Replaced ch->mmu_debug_mode_enabled with ch->mmu_debug_mode_refcnt.
If channel is enabled multiple times by userspace, then ref count is
updated accordingly. There is an expectation that enable/disable
calls are balanced for setting channel's mmu debug mode.
When unbinding the channel, decrease refcnt for the channel until it
reaches 0.
Also, removed tsg parameter from nvgpu_tsg_set_mmu_debug_mode as it
can be retrieved from ch.
Bug 2515097
Change-Id: If334e374a55bd14ae219edbfd3b1fce5ff25c226
Signed-off-by: Thomas Fleury <tfleury@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2184702
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
This patch adds nvgpu API in linux and posix to query vpr resize.
The new API nvgpu_is_vpr_resize_enabled() is used in
nvgpu_submit_channel_gpfifo().
Previously, if non-deterministic channel has timeout disabled and
GPU cannot railgate on some platform, then channel doesn't power ref
count and results in video freeze. To resolve non-determinstic channel
job tracking needs to be enabled if vpr resize is supported or if GPU
can railgate.
Bug 200532122
Change-Id: Icfbff6253762b195b2f5955749343974b1a7a269
Signed-off-by: Vedashree Vidwans <vvidwans@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2171093
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
For safety build, nvgpu driver should enter SW quiesce state
in case an uncorrectable error has occurred. In this state, any
activity on the GPU should be prevented, without powering off the GPU.
Also, a minimal set of operations should be used to enter SW quiesce
state.
Entering SW quiesce state does the following:
- set sw_quiesce_pending: when this flag is set, interrupt
handlers exit after masking interrupts. This should help mitigate
an interrupt storm.
- wake up thread to complete quiescing.
The thread performs the following:
- set NVGPU_DRIVER_IS_DYING to prevent allocation of new resources
- disable interrupts
- disable fifo scheduling
- preempt all runlists
- set error notifier for all active channels
Note: for channels with usermode submit enabled, userspace can
still ring doorbell, but this will not trigger any work on
engines since fifo scheduling is disabled.
Jira NVGPU-3493
Change-Id: I639a32da754d8833f54dcec1fa23135721d8d89a
Signed-off-by: Thomas Fleury <tfleury@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2172391
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>