nvgpu_log/info/warn/err() internally add a \n to the end of the message.
Hence, callers should not include a \n at the end of the message. Doing
so results in duplicate \n being printed, which ends up creating empty
log messages. Remove the duplicate \n from all messages.
Bug 1928311
Change-Id: I21c141934a125e0cc0cead9fb19fa6502235cf06
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Reviewed-on: http://git-master/r/1487233
Reviewed-by: Automatic_Commit_Validation_User
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
(1) Re-arrange the structure for ecc counters reporting so multiple
units can be managed
(2) Add counters and handling for additional GPC counters
JIRA: GPUT19X-84
Change-Id: I74fd474d7daf7590fc7f7ddc9837bb692512d208
Signed-off-by: David Nieto <dmartineznie@nvidia.com>
Reviewed-on: http://git-master/r/1485277
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
Adding support for ISR handling of GPCCS exceptions
and GCC ECC support
JIRA: GPUT19X-83
Change-Id: Ica749dc678f152d536052cf47f2ea2b205a231d6
Signed-off-by: David Nieto <dmartineznie@nvidia.com>
Reviewed-on: http://git-master/r/1480997
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
This CL covers the following parity support (uncorrected error),
1) SM's L1 DATA
2) SM's L0 && L1 icache
Volta Resiliency Id - Volta-634
JIRA GPUT19X-113
JIRA GPUT19X-99
Bug 1807553
Change-Id: Iacbf492028983529dadc5753007e43510b8cb786
Signed-off-by: Lakshmanan M <lm@nvidia.com>
Reviewed-on: http://git-master/r/1483681
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
This CL covers the following parity support (uncorrected error),
1) SM's LRF
2) SM's CBU
Volta Resiliency Id - Volta-637
JIRA GPUT19X-85
JIRA GPUT19X-110
Bug 1775457
Change-Id: I3befb1fe22719d06aa819ef27654aaf97f911a9b
Signed-off-by: Lakshmanan M <lm@nvidia.com>
Reviewed-on: http://git-master/r/1481791
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Adding support for ISR handling of ecc uncorrectable errors
for volta resiliency (Volta-686)
TODO: move interrupt init out of MC
bug 1881052
JIRA: GPUT19X-82
Change-Id: I45db01a6062445dd1f64a8297744cd15105e3344
Signed-off-by: David Nieto <dmartineznie@nvidia.com>
Reviewed-on: http://git-master/r/1476603
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
Added function pointers to check chip specific valid
gfx class and compute class. Also added function pointer
to update ctx header with preemption buffer pointers.
Also fall back to gp10b functions, where nothing
is changed from gp10b to gv11b.
Bug 200292090
Change-Id: I69900e32bbcce4576c4c0f7a7119c7dd8e984928
Signed-off-by: seshendra Gadagottu <sgadagottu@nvidia.com>
Reviewed-on: http://git-master/r/1293503
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
Use the new clk HAL to request clock rate instead of direct calls
to Clock Framework. This cuts one direct dependency to Linux APIs.
Also change the HAL to not clear clk ops after they've been
initialized.
JIRA NVGPU-16
Change-Id: I1ab3eac8268f1f3f3305d49782c6a0eb57c6d617
Signed-off-by: Terje Bergstrom <tbergstrom@nvidia.com>
Reviewed-on: http://git-master/r/1463536
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Without these credits, gpu mmu binds over nvlink to soc are hanging.
Also add l2_enabled for mc_elpg_enable.
Bug 1899460
Change-Id: I0b26410d5c8ec9b4c88b319ddd9442f2fd91b321
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1463204
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
gk20a_err() and gk20a_warn() require a struct device pointer,
which is not portable across operating systems. The new nvgpu_err()
and nvgpu_warn() macros take struct gk20a pointer.
Convert the last remaining user of old macros to new ones.
JIRA NVGPU-16
Change-Id: Ib665cfb395fe46ac988ed14d67adef885098e524
Signed-off-by: Terje Bergstrom <tbergstrom@nvidia.com>
Reviewed-on: http://git-master/r/1462968
Reviewed-by: Alex Waterman <alexw@nvidia.com>
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Defined function to get max number of subcontexs
supported and used it where max subcontext count required.
JIRA GV11B-23
Change-Id: I4f6307162486bab1e91cbf66abfee7763c70fe7b
Signed-off-by: seshendra Gadagottu <sgadagottu@nvidia.com>
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1318146
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
Add handling for below two interrupts on top of legacy ones. When pending,
PBDMA is stalled and s/w is expected to execute teardown.
clear_faulted_error: host is asked to clear fault status when no fault has been asserted.
eng_reset: An engine was reset while the PBDMA unit was processing a
channel from a runlist which serves the engine.
JIRA GPUT19X-47
Change-Id: I776e5799a73a1b63c394048fa61b597e621cf544
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1306558
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
Tested-by: Terje Bergstrom <tbergstrom@nvidia.com>
if ch/tsg id is unknown and bit mask for the engines that need to be
recovered is not set, runlist mask should correspond to max number of
supported runlists
JIRA GPUT19X-7
Change-Id: I08e67af0846784a7918510d68de34e9162a42bf6
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1458155
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
- detect and decode sched_error type. Any sched error starting with xxx_* is
not supported in h/w and should never be seen by s/w
- for bad_tsg sched error, preempt all runlists to recover as faulted ch/tsg
is unknown. For other errors, just report error.
- ctxsw timeout is not part of sched error fifo interrupt. A new
fifo interrupt, ctxsw timeout is added in gv11b. Add s/w handling.
Bug 1856152
JIRA GPUT19X-74
Change-Id: I474e1a3cda29a450691fe2ea1dc1e239ce57df1a
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1317615
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
gk20a_err() and gk20a_warn() require a struct device pointer,
which is not portable across operating systems. The new nvgpu_err()
and nvgpu_warn() macros take struct gk20a pointer. Convert code
to use the more portable macros.
JIRA NVGPU-16
Change-Id: I8c0d8944f625e3c5b16a9f5a2a59d95a680f4e55
Signed-off-by: Terje Bergstrom <tbergstrom@nvidia.com>
Reviewed-on: http://git-master/r/1459822
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
Reviewed-by: Alex Waterman <alexw@nvidia.com>
GVS: Gerrit_Virtual_Submit
Context TSG teardown procedure:
1. Disable scheduling for the engine's runlist via NV_PFIFO_SCHED_DISABLE.
This enables SW to determine whether a context has hung later in the
process: otherwise, ongoing work on the runlist may keep ENG_STATUS from
reaching a steady state.
2. Disable all channels in the TSG being torn down or submit a new runlist
that does not contain the TSG. This is to prevent the TSG from being
rescheduled once scheduling is reenabled in step 6.
3. Initiate a preempt of the engine by writing the bit associated with its
runlist to NV_PFIFO_RUNLIST_PREEMPT. This allows to begin the preempt
process prior to doing the slow register reads needed to determine
whether the context has hit any interrupts or is hung. Do not poll
NV_PFIFO_RUNLIST_PREEMPT for the preempt to complete.
4. Check for interrupts or hangs while waiting for the preempt to complete.
During the pbdma/eng preempt finish polling, any stalling interrupts
relating to runlist must be detected and handled in order for the
preemption to complete.
5. If a reset is needed as determined by step 4:
a. Halt the memory interface for the engine (as per the relevant engine
procedure).
b. Reset the engine via NV_PMC_ENABLE.
c. Take the engine out of reset and reinit the engine (as per
relevant engine procedure)
6. Re-enable scheduling for the engine's runlist via NV_PFIFO_SCHED_ENABLE.
JIRA GPUT19X-7
Change-Id: I1354dd12b4a4f0e4b4a8d9721581126c02288a85
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1327931
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
Move code that touches host registers to fifo HAL. This sorts out
some of the dependencies between fifo HAL and channel HAL.
Change-Id: I2bff0443ae1c1fa5608e620974b440696d1cfdc4
Signed-off-by: Terje Bergstrom <tbergstrom@nvidia.com>
Reviewed-on: http://git-master/r/1323385
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Move the name field from struct gpu_ops up to struct gk20a. The field
is not a function op, so it doesn't belong in gpu_ops.
Replace all uses of dev_name() with use of g->name when possible.
JIRA NVGPU-16
Change-Id: I053aeb256f591af2ea9ef5094a20e33a395cdd33
Signed-off-by: Terje Bergstrom <tbergstrom@nvidia.com>
Reviewed-on: http://git-master/r/1328535
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
-implement gv11b specific reset_enable_hw fifo ops
-timeout period in fifo_fb_timeout_r() is set to init instead of max
This register specifies the number of microseconds Host
should wait for a response from FB before initiating a timeout interrupt.
For bringup, this value should be set to a lower value than usual, such as
~.5 milliseconds (500), to help find out bugs in the memory subsystem.
-timeout period in pbdma_timeout_r() is set to init instead of max
This register contains a value used for detecting timeouts.
The timeout value is in microsecond ticks.
The timeouts that use this value are:
GPfifo fetch timeouts to FB for acks, reqs, rdats.
PBDMA connection to LB.
GPfifo processor timeouts to FB for acks, reqs, rdats.
Method processor timeouts to FB for acks, reqs, rdats.
The init value is changed to 64K us based on bug 1816557.
JIRA GPUT19X-74
JIRA GPUT19X-47
Change-Id: I6f818e129c3ea67571d206c5e735607cbfcf6ec6
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1325352
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
CTX_STATUS_SWITCH: Engine save hasn't started yet, continue to poll
CTX_STATUS_INVALID: The engine context has switched off. The
preemption step for this engine is complete.
CTX_STATUS_VALID or CTX_STATUS_CTXSW_SAVE: check the ID field:
* If ID matches the TSG for the context being torn down, the engine
reset procedure can be performed, or SW can continue
waiting for preempt to finish if id is not being torn down.
* If ID does NOT match, the context isn't running on the engine.
CTX_STATUS_LOAD: check the NEXT_ID field:
* If NEXT_ID matches the TSG of the context being torn down, the engine
is loading the context and reset can be performed
immediately or after a delay to allow the context a chance to load and
be saved off, or sw can continue waiting for preempt to finish if id
is not being torn down.
* If NEXT_ID does not match the TSG ID or CHID then the context is no
longer on the engine.
SW may alternatively wait for the CTX_STATUS to reach INVALID, but this
may take longer if an unrelated context is currently on the engine or
being switched to.
JIRA GPUT19X-7
Change-Id: I61499f932019de32e0200084c4c41b21a7cbbd2b
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1327164
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
Reviewed-by: Seshendra Gadagottu <sgadagottu@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>
Init device_fatal, channel_fatal and restartable fifo intr pbdma s/w
variables for pbdma_intr_0 interrupt masks.
pbdma_intr_0 field changes for gv11b:-
bit 8(lbreq) does not exists in hw.
bit 28 (syncpoint_illegal)is removed in hw.
bit 20 is reused for clear_faulted_error in hw.
bit 24 (eng_reset) and bit 25 (semaphore) always existed in hw
but never handled in s/w. These are added as channel fatal.
JIRA GPUT19X-47
Change-Id: I13673430408f1cf7ef762075a29b94196f79a349
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: http://git-master/r/1325401
Reviewed-by: svccoveritychecker <svccoveritychecker@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Terje Bergstrom <tbergstrom@nvidia.com>