Decouple the fence information needed for providing submit postfences to
userspace by adding a separate type for that and using it to pass fence
data to ioctls.
The data in struct nvgpu_fence_type is used in various places:
- job tracking needs to know when a post fence is expired
- job submitters within the driver (vidmem clears) need to be able to
wait for these fences
- userspace needs the fence as an id, value pair or as a file descriptor
created from an os fence
To keep object lifetimes strict, start decoupling the os fence data out
of struct nvgpu_fence_type: delete nvgpu_fence_install_fd() and add
nvgpu_fence_extract_user() to return a struct nvgpu_user_fence that
contains only the necessary information. Storing the os fence in job
tracking metadata is legacy code and not useful. Passing the os fence
from where it's created through the whole submit path inside this
combined fence type has been convenient, though.
The internally stored cde job fence in dmabuf compression metadata is
still nvgpu_fence_type to keep this patch simple.
Jira NVGPU-5248
Change-Id: I75b7da676fb6aa083828f888c55571bbf7645ef3
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2359064
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Add a generic profiler based on the channel kickoff profiler. This
aims to provide a mechanism to allow engineers to (more) easily profile
arbitrary software paths within nvgpu.
Usage of this profiler is still primarily through debugfs. Next up is
a generic debugfs interface for this profiler in the Linux code.
The end goal for this is to profile the recovery code and generate
interesting statistics.
JIRA NVGPU-5606
Signed-off-by: Alex Waterman <alexw@nvidia.com>
Change-Id: I99783ec7e5143855845bde4e98760ff43350456d
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2355319
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
This adds a new device management unit in the common code responsible
for facilitating the parsing of the GPU top device list and providing
that info to other units in nvgpu.
The basic idea is to read this list once from HW and store it in a
set of lists corresponding to each device type (graphics, LCE, etc).
Many of the HALs in top can be deleted and instead implemented using
common code parsing the SW representation.
Every time the driver queries the device list it does so using a
device type and instance ID. This is common code. The HAL is responsible
for populating the device list in such a way that the driver can
query it in a chip agnostic manner.
Also delete some of the unit tests for functions that no longer
exist. This code will require new unit tests in time; those should be
quite simple to write once unit testing is needed.
JIRA NVGPU-5421
Change-Id: Ie41cd255404b90ae0376098a2d6e9f9abdd3f5ea
Signed-off-by: Alex Waterman <alexw@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2319649
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
In Bug 200588835, the spurious FBPA interrupts are seen on couple
of boards. These interrupts were found to be EDC (Error detection
and Correction) interrupts which are triggered due to ECC errors.
The EDC registers are not exposed to the driver, so the interrupt
status register cannot be cleared; resulting in interrupt storm.
Also, it was concluded that only bad HW can cause this failure
scenario. So, in the ISR for FBPA interrupts, get the GPU into
quiesce state as we don't expect the GPU to be in usable state post
such unrecoverable errors.
Adapt the quiesce code for Linux build too.
1. On Linux, we cannot exit the nvgpu process after quiesce like we
do on QNX. So, add nvgpu_disable_irqs() call to quiesce
implementation which is done as part of process exit handler on
QNX. Masking interrupts which is already done as part of quiesce
would be sufficient in most cases, but to be fail-safe
disable_irqs too.
3. Also, the IOCTL code looks at g->sw_ready, hence add
nvgpu_start_gpu_idle() to set g->sw_ready to false along with
setting NVGPU_DRIVER_IS_DYING = true.
We expect the nvgpu_sw_quiesce() call to finish before quiesce thread
wakes up from 50ms sleep. Hence, critical step like
nvgpu_start_gpu_idle() is added to nvgpu_sw_quiesce(), whereas the
somewhat redundant disable IRQs call is added to quiesce thread.
nvgpu_fifo_quiesce() was called twice by mistake; remove one of the
them.
Bug 2919899
Bug 200588835
Change-Id: I9beec688c2e1c0d8dfc1327ddf122684576f8684
Signed-off-by: Tejal Kudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2354537
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
The valid flag in struct nvgpu_fence_type is not very useful. It's set
when a fence is created on an allocated object and read in these three
scenarios:
- nvgpu_fence_install_fd() after a submit, if the submit was successful.
A successful submit implies that a post fence exists.
- nvgpu_fence_wait() for a copyengine job when synchronizing the ce
ringbuffer or when waiting for vidmem clears. In these cases the fence
is also clearly always valid.
- nvgpu_fence_is_expired() when testing whether a tracked job has
completed. Such jobs cannot exist without post fences that are
mandatory for tracking, so the fence must exist.
Remove the valid flag. Remove also the other init checks from the above
functions; they're equally unused and confusing implying that such calls
would be acceptable, causing sloppy code at best.
Jira NVGPU-5248
Jira NVGPU-5493
Change-Id: I52c5be1569b343024d2626bd9577f87b46064fba
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2357828
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Implement empty stubs of the channel watchdog functions for when
watchdog is disabled from build. Add some forward declarations that were
missing. Now most call sites don't need #idefs for the build flag.
Add error checks for the wdt alloc failure.
Jira NVGPU-5494
Jira NVGPU-5493
Change-Id: I2d42e8ab4c5e045cd280b2e1f254396127bd154b
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2352050
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Currently, nvgpu_writel_loop() writes to a register and immediately
checks if register value is updated. It might take some time for
hardware registers to get updated with value written by software.
Modify nvgpu_writel_loop() to accept number of retries to check if
register value is updated and assert with nvgpu_assert().
Also, move nvgpu_writel_loop() to common code and use generic
nvgpu_readl() and nvgpu_writel() APIs.
JIRA NVGPU-5490
Change-Id: Iaaf24203a91eee3d05de7d0c7dea18113367de5f
Signed-off-by: Vedashree Vidwans <vvidwans@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2348628
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com>
Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Currently, ltc fs_state is initialized during ltc init support. However,
ltc cbc_param and cbc_param2 registers do not seem to be providing
correct data if ltc.init_fs_state is called before fb.init_fs_state.
- Create fb.init_fb_support hal to initialize fb.
- Trigger init_fb_support before init_ltc_support.
Bug 2969956
Bug 2957808
JIRA NVGPU-4666
Change-Id: I54d697d27b9d9c6318c4ef459d215b6f82cd5571
Signed-off-by: Vedashree Vidwans <vvidwans@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2345673
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Make Recovery and quiesce co-exist to support quiesce state
on unrecoverrable errors. Currently, the quiesce code is wrapped
under ifndef CONFIG_NVGPU_RECOVERY. Isolate the quiesce code from
recovery config, thereby enabling it on all builds.
On Linux, the hung_task checker(check_hung_uninterruptible_tasks()
in kernel/hung_task.c) complains that quiesce thread is stuck for
more than 120 seconds.
INFO: task sw-quiesce:1068 blocked for more than 120 seconds.
The wait time of more than 120 seconds is expected as quiesce
thread will wait until quiesce call is triggered on fatal
unrecoverable errors. However, the INFO print upsets the
kernel_warning_test(KWT) on Linux builds. To fix the failing
KWT, change the quiesce task to interruptible instead of
uninterruptible as checker only looks at uninterruptible tasks.
Bug 2919899
JIRA NVGPU-5479
Change-Id: Ibd1023506859d8371998b785e881ace52cb5f030
Signed-off-by: tkudav <tkudav@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2342774
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
GVS: Gerrit_Virtual_Submit
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
The gk20a_debug_dump() function implicitly adds a newline since it
uses nvgpu_err() under the hood (for uart destined prints). For the
seq_file destined writes it does not so there is an annoying inconsistency.
Remove the newline that many of the gk20a_debug_dump() calls add and add
the newline to the (now) seq_printf() call. This reduces the length of
debug dump logs and speeds them up - UART is _very_ slow after all.
Also cleanup some formatting issues in the various debug prints I
happened to notice.
JIRA NVGPU-5541
Change-Id: Iabf853d5c50214794fc4cbb602dfffabeb877132
Signed-off-by: Alex Waterman <alexw@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2347956
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: Konsta Holtta <kholtta@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
The ecc init, handling for the fb unit is refactored to improve reusability
for nvgpu-next.
The following changes have been done:
- fb.ecc:
This is a new subunit within fb and contains the following functions:
- init: Moved from fb.fb_ecc_init.
- free: Moved from fb.fb_ecc_free.
- l2tlb_error_mask: Fetch bit mask for corrected, uncorrected errors supported
by the unit.
- fb.intr:
This unit has been updated to include the following ecc interrupt, error
handlers:
- handle_ecc: Top level interrupt handler for fb ecc errors.
- handle_ecc_l2tlb: Handle errors within l2tlb memory.
- handle_ecc_hubtlb: Handle errors within hubtlb memory.
- handle_ecc_fillunit: Handle errors within fillunit memory
Jira: NVGPU-5032
Change-Id: I1a26c1823eb992e0e0175250b969f1186dff6e62
Signed-off-by: Antony Clince Alex <aalex@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2333271
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Split out the max value increment and syncpt interrupt registration out
of nvgpu_channel_sync_incr*(). This API is called in the submit path to
prepare buffers and tracking resources, but later on in the submit path
errors can still occur so that the increment wouldn't happen (unless
artificially forced by sw).
The increment and irq registration cannot easily be undone and it makes
more sense to do these at the moment when the prepared job is finally
ready, so add a new nvgpu_channel_sync_mark_progress() API to be called
later in the submit path to signal that progress shall eventually happen
on the sync. Without this, the max value would stay too large after an
unsuccessful submit until the channel gets closed.
The sync object (syncpt or semaphore) is always exclusively owned by the
channel that allocated it, so nonatomically reading the max value first
in sync_incr() and incrementing it later in mark_progress() is racefree;
all submits per channel are serialized.
Change the channel syncpoint to client managed from host managed so that
nvhost-exported sync fences behave correctly with the temporary state
where the fence threshold is over the max value. Ideally we'd always
track nvgpu-owned syncpts' max values internally, but this is enough for
now.
Jira NVGPU-5491
Change-Id: Idf0bda7ac93d7f2f114cdeb497fe6b5369d21c95
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2340465
Reviewed-by: automaticguardword <automaticguardword@nvidia.com>
Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Move the per-channel hw semaphore object to be owned by the channel sync
(just like with syncpoints, too). Store just the channel ID in the hw
sema for debug prints to get rid of sema->channel dependencies. Make
nvgpu_semaphore_alloc() take a hw sema instead of a channel.
Fix up some channel-related documentation that has been incorrect.
Jira NVGPU-5353
Change-Id: I04d49da3aac50a4cea32e7393f48e6f85a80ca0d
Signed-off-by: Konsta Hölttä <kholtta@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2339931
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Instead of using multiple struct use a single nvgpu_clk_slave_freq
to store the slave freq of gpcclk.
With this patch single struct can be used by both clk_arb and clk_domain.
This will remove nvgpu_set_fll_clk struct as nvgpu_clk_slave_freq serves
the purpose.
NVGPU-4692
Change-Id: Ie45d63e4376b83e153a9aa75e2c4631c6dad857b
Signed-off-by: Abdul Salam <absalam@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2339213
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>