This CL covers the following parity support (uncorrected error),
1) SM's LRF
2) SM's CBU
Volta Resiliency Id - Volta-637
JIRA GPUT19X-85
JIRA GPUT19X-110
Bug 1775457
Change-Id: I3befb1fe22719d06aa819ef27654aaf97f911a9b
Signed-off-by: Lakshmanan M <lm@nvidia.com>
Reviewed-on: http://git-master/r/1481791
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Adding support for ISR handling of ecc uncorrectable errors
for volta resiliency (Volta-686)
TODO: move interrupt init out of MC
bug 1881052
JIRA: GPUT19X-82
Change-Id: I45db01a6062445dd1f64a8297744cd15105e3344
Signed-off-by: David Nieto <dmartineznie@nvidia.com>
Reviewed-on: http://git-master/r/1476603
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
Added function pointers to check chip specific valid
gfx class and compute class. Also added function pointer
to update ctx header with preemption buffer pointers.
Also fall back to gp10b functions, where nothing
is changed from gp10b to gv11b.
Bug 200292090
Change-Id: I69900e32bbcce4576c4c0f7a7119c7dd8e984928
Signed-off-by: seshendra Gadagottu <sgadagottu@nvidia.com>
Reviewed-on: http://git-master/r/1293503
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
Use the new clk HAL to request clock rate instead of direct calls
to Clock Framework. This cuts one direct dependency to Linux APIs.
Also change the HAL to not clear clk ops after they've been
initialized.
JIRA NVGPU-16
Change-Id: I1ab3eac8268f1f3f3305d49782c6a0eb57c6d617
Signed-off-by: Terje Bergstrom <tbergstrom@nvidia.com>
Reviewed-on: http://git-master/r/1463536
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Without these credits, gpu mmu binds over nvlink to soc are hanging.
Also add l2_enabled for mc_elpg_enable.
Bug 1899460
Change-Id: I0b26410d5c8ec9b4c88b319ddd9442f2fd91b321
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1463204
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
gk20a_err() and gk20a_warn() require a struct device pointer,
which is not portable across operating systems. The new nvgpu_err()
and nvgpu_warn() macros take struct gk20a pointer.
Convert the last remaining user of old macros to new ones.
JIRA NVGPU-16
Change-Id: Ib665cfb395fe46ac988ed14d67adef885098e524
Signed-off-by: Terje Bergstrom <tbergstrom@nvidia.com>
Reviewed-on: http://git-master/r/1462968
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Defined function to get max number of subcontexs
supported and used it where max subcontext count required.
JIRA GV11B-23
Change-Id: I4f6307162486bab1e91cbf66abfee7763c70fe7b
Signed-off-by: seshendra Gadagottu <sgadagottu@nvidia.com>
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1318146
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
Add handling for below two interrupts on top of legacy ones. When pending,
PBDMA is stalled and s/w is expected to execute teardown.
clear_faulted_error: host is asked to clear fault status when no fault has been asserted.
eng_reset: An engine was reset while the PBDMA unit was processing a
channel from a runlist which serves the engine.
JIRA GPUT19X-47
Change-Id: I776e5799a73a1b63c394048fa61b597e621cf544
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1306558
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
Tested-by: Terje Bergstrom <tbergstrom@nvidia.com>
if ch/tsg id is unknown and bit mask for the engines that need to be
recovered is not set, runlist mask should correspond to max number of
supported runlists
JIRA GPUT19X-7
Change-Id: I08e67af0846784a7918510d68de34e9162a42bf6
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1458155
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
- detect and decode sched_error type. Any sched error starting with xxx_* is
not supported in h/w and should never be seen by s/w
- for bad_tsg sched error, preempt all runlists to recover as faulted ch/tsg
is unknown. For other errors, just report error.
- ctxsw timeout is not part of sched error fifo interrupt. A new
fifo interrupt, ctxsw timeout is added in gv11b. Add s/w handling.
Bug 1856152
JIRA GPUT19X-74
Change-Id: I474e1a3cda29a450691fe2ea1dc1e239ce57df1a
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1317615
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
gk20a_err() and gk20a_warn() require a struct device pointer,
which is not portable across operating systems. The new nvgpu_err()
and nvgpu_warn() macros take struct gk20a pointer. Convert code
to use the more portable macros.
JIRA NVGPU-16
Change-Id: I8c0d8944f625e3c5b16a9f5a2a59d95a680f4e55
Signed-off-by: Terje Bergstrom <tbergstrom@nvidia.com>
Reviewed-on: http://git-master/r/1459822
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
GVS: Gerrit_Virtual_Submit
Context TSG teardown procedure:
1. Disable scheduling for the engine's runlist via NV_PFIFO_SCHED_DISABLE.
This enables SW to determine whether a context has hung later in the
process: otherwise, ongoing work on the runlist may keep ENG_STATUS from
reaching a steady state.
2. Disable all channels in the TSG being torn down or submit a new runlist
that does not contain the TSG. This is to prevent the TSG from being
rescheduled once scheduling is reenabled in step 6.
3. Initiate a preempt of the engine by writing the bit associated with its
runlist to NV_PFIFO_RUNLIST_PREEMPT. This allows to begin the preempt
process prior to doing the slow register reads needed to determine
whether the context has hit any interrupts or is hung. Do not poll
NV_PFIFO_RUNLIST_PREEMPT for the preempt to complete.
4. Check for interrupts or hangs while waiting for the preempt to complete.
During the pbdma/eng preempt finish polling, any stalling interrupts
relating to runlist must be detected and handled in order for the
preemption to complete.
5. If a reset is needed as determined by step 4:
a. Halt the memory interface for the engine (as per the relevant engine
procedure).
b. Reset the engine via NV_PMC_ENABLE.
c. Take the engine out of reset and reinit the engine (as per
relevant engine procedure)
6. Re-enable scheduling for the engine's runlist via NV_PFIFO_SCHED_ENABLE.
JIRA GPUT19X-7
Change-Id: I1354dd12b4a4f0e4b4a8d9721581126c02288a85
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1327931
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
Move code that touches host registers to fifo HAL. This sorts out
some of the dependencies between fifo HAL and channel HAL.
Change-Id: I2bff0443ae1c1fa5608e620974b440696d1cfdc4
Signed-off-by: Terje Bergstrom <tbergstrom@nvidia.com>
Reviewed-on: http://git-master/r/1323385
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Move the name field from struct gpu_ops up to struct gk20a. The field
is not a function op, so it doesn't belong in gpu_ops.
Replace all uses of dev_name() with use of g->name when possible.
JIRA NVGPU-16
Change-Id: I053aeb256f591af2ea9ef5094a20e33a395cdd33
Signed-off-by: Terje Bergstrom <tbergstrom@nvidia.com>
Reviewed-on: http://git-master/r/1328535
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
-implement gv11b specific reset_enable_hw fifo ops
-timeout period in fifo_fb_timeout_r() is set to init instead of max
This register specifies the number of microseconds Host
should wait for a response from FB before initiating a timeout interrupt.
For bringup, this value should be set to a lower value than usual, such as
~.5 milliseconds (500), to help find out bugs in the memory subsystem.
-timeout period in pbdma_timeout_r() is set to init instead of max
This register contains a value used for detecting timeouts.
The timeout value is in microsecond ticks.
The timeouts that use this value are:
GPfifo fetch timeouts to FB for acks, reqs, rdats.
PBDMA connection to LB.
GPfifo processor timeouts to FB for acks, reqs, rdats.
Method processor timeouts to FB for acks, reqs, rdats.
The init value is changed to 64K us based on bug 1816557.
JIRA GPUT19X-74
JIRA GPUT19X-47
Change-Id: I6f818e129c3ea67571d206c5e735607cbfcf6ec6
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1325352
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
CTX_STATUS_SWITCH: Engine save hasn't started yet, continue to poll
CTX_STATUS_INVALID: The engine context has switched off. The
preemption step for this engine is complete.
CTX_STATUS_VALID or CTX_STATUS_CTXSW_SAVE: check the ID field:
* If ID matches the TSG for the context being torn down, the engine
reset procedure can be performed, or SW can continue
waiting for preempt to finish if id is not being torn down.
* If ID does NOT match, the context isn't running on the engine.
CTX_STATUS_LOAD: check the NEXT_ID field:
* If NEXT_ID matches the TSG of the context being torn down, the engine
is loading the context and reset can be performed
immediately or after a delay to allow the context a chance to load and
be saved off, or sw can continue waiting for preempt to finish if id
is not being torn down.
* If NEXT_ID does not match the TSG ID or CHID then the context is no
longer on the engine.
SW may alternatively wait for the CTX_STATUS to reach INVALID, but this
may take longer if an unrelated context is currently on the engine or
being switched to.
JIRA GPUT19X-7
Change-Id: I61499f932019de32e0200084c4c41b21a7cbbd2b
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1327164
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
Reviewed-by: Seshendra Gadagottu <sgadagottu@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
Init device_fatal, channel_fatal and restartable fifo intr pbdma s/w
variables for pbdma_intr_0 interrupt masks.
pbdma_intr_0 field changes for gv11b:-
bit 8(lbreq) does not exists in hw.
bit 28 (syncpoint_illegal)is removed in hw.
bit 20 is reused for clear_faulted_error in hw.
bit 24 (eng_reset) and bit 25 (semaphore) always existed in hw
but never handled in s/w. These are added as channel fatal.
JIRA GPUT19X-47
Change-Id: I13673430408f1cf7ef762075a29b94196f79a349
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1325401
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
The calls to nvhost_{register,unregister}_client_domain don't do
anything, so remove the platform device's late_probe and remove ops that
serve no other purpose than calling those empty functions. Remove also
the corresponding #includes which are now unused.
Bug 1853519
Change-Id: I67149d1575be5b3cacc60e6c28e6f2debfabf71c
Signed-off-by: Konsta Holtta <kholtta@nvidia.com>
Reviewed-on: http://git-master/r/1326126
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
Reviewed-by: Seshendra Gadagottu <sgadagottu@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
preempt completion should be decided based on pbdma and
engine status. preempt_pending field is no longer used
to detect if preempt finished.
add a new function to to be used for preeempting ch and tsg
during recovery. If preempt timeouts while in recovery, do not
issue recovery.
JIRA GPUT19X-7
Change-Id: I0d69d12ee6a118f6628b33be5ba387c72983b32a
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1309850
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>