linux-nvgpu

mirror of git://nv-tegra.nvidia.com/linux-nvgpu.git synced 2025-12-24 10:34:43 +03:00

Author	SHA1	Message	Date
Tejal Kudav	b80b2bdab8	gpu: nvgpu: Add CE interrupt handling a. LAUNCH_ERR - Userspace error. - Triggered due to faulty launch. - Handle using recovery to reset CE engine and teardown the faulty channel. b. An INVALID_CONFIG - - Triggered when LCE is mapped to floorswept PCE. - On iGPU, we use the default PCE 2 LCE HW mapping. The default mapping can be read from NV_CE_PCE2LCE_CONFIG INIT value in CE refmanual. - NvGPU driver configures the mapping on dGPUs (currently only on Turing). - So, this interrupt can only be triggered if there is kernel or HW error - Recovery ( which is killing the context + engine reset) will not help resolve this error. - Trigger Quiesce as part of handling. c. A MTHD_BUFFER_FAULT - - NvGPU driver allocates fault buffers for all TSGs or contexts, maps them in BAR2 VA space and writes the VA into channel instance block. - Can be triggered only due to kernel bug - Recovery will not help, need quiesce d. FBUF_CRC_FAIL - Triggered when the CRC entry read from the method fault buffer does not match the computed CRC from the methods contained in the buffer. - This indicates memory corruption and is a fatal interrupt which at least requires the LCE to be reset before operations can start again, if not the entire GPU. - Better to quiesce on memory corruption CE Engine reset (via recovery) will not help. e. FBUF_MAGIC_CHK_FAIL - Triggered when the MAGIC_NUM entry read from the method fault buf does not match NV_CE_MTHD_BUFFER_GLOBAL_HDR_MAGIC_NUM_VAL - This indicates memory corruption and is a fatal interrupt - Better to quiesce on memory corruption f. STALLING_DEBUG - Only triggered with SW write for debug purposes - Debug interrupt, currently ignored Move launch error handling from GP10b to GV11b HAL as - 1. LAUNCHERR_REPORT errcode METHOD_BUFFER_ACCESS_FAULT is not defined on Pascal 2. We do not support GP10b on dev-main ToT JIRA NVGPU-8102 Change-Id: Idc84119bc23b5e85f3479fe62cc8720e98b627a5 Signed-off-by: Tejal Kudav <tkudav@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2678893 Tested-by: mobile promotions <svcmobile_promotions@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>	2022-03-14 17:12:14 -07:00
Tejal Kudav	3bfab5df3f	gpu: nvgpu: Disable fault mthd buf intrs on safety Below CE interrupts are disabled on safety build as fault and switch mechanism is not supported on safety: NV_CE_LCE_INTR_STATUS_MTHD_BUFFER_FAULT NV_CE_LCE_INTR_STATUS_FBUF_CRC_FAIL NV_CE_LCE_INTR_STATUS_FBUF_MAGIC_CHK_FAIL Bug 3548082 Change-Id: I400cd02a8c9888b7ef0d71bbc1f7d792b48e8227 Signed-off-by: Tejal Kudav <tkudav@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2679052 Tested-by: mobile promotions <svcmobile_promotions@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>	2022-03-10 16:04:37 -08:00
Tejal Kudav	3fe70bf86e	gpu: nvgpu: Update CE Intr code as per Orin HSIs Below CE interrupts do not have any users(usecases) on safety build; disable them only on safety build. 1. BLOCKPIPE stall intr: Not used by GFX(VKSC) and CUDA on safety. 2. NONBLOCK_PIPE nonstall intr: Non-stall intrs are not supported on safety build. Also, this one is not used by GFX(VKSC) and CUDA. 3. STALLING_DEBUG intr: Added in Orin tree. It is only needed for debugging. Disable on safety build as there is no current usage in driver. 4. POISON_ERROR intr: Poison is a fault containment and not supported on GA10b. 5. INVALID_CONFIG intr: Floor sweeping not supported on functional safety SKU. Bug 3548082 Change-Id: I8d97ccb38f138b2c04a780e1c255a64d28723405 Signed-off-by: Tejal Kudav <tkudav@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2671927 Tested-by: mobile promotions <svcmobile_promotions@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>	2022-03-08 11:41:26 -08:00
Rajesh Devaraj	0699220b85	gpu: nvgpu: compile-out unused apis from safety build This patch does the following changes: - Compiles-out unused error reporting APIs and the related data structures from safety build. For this purpose, it introduces the new flag: CONFIG_NVGPU_INTR_DEBUG - Updates nvgpu_report_err_to_sdl() API with one more argument, hw_unit_id. This aids in finding whether an error to be reported is corrected or uncorrected from LUT. - Triggers SW quiesce, if an uncorrected error is reported to Safety_Services, in safety build. - Renames files in cic folder by replacing gv11b with ga10b, since error reporting for gv11b is not supported in dev-main. JIRA NVGPU-8002 Change-Id: Ic01e73b0208252abba1f615a2c98d770cdf41ca4 Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2668466 Reviewed-by: Tejal Kudav <tkudav@nvidia.com> Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com> Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com> Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com> Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com> GVS: Gerrit_Virtual_Submit	2022-02-14 22:00:33 -08:00
Rajesh Devaraj	878235e914	gpu: nvgpu: remove report error callback In DRIVE 6.0, NvGPU needs to support error reporting in QNX-Safety, QNX-Standard, and Linux. To support error reporting in all these platform variants, SDL unit will be moved from QNX to common code. As part of this refactoring activity, this patch removes ops assignment for report error. Also, it removes API calls that are used to take time-stamp for stall interrupt thread. This time-stamp APIs will be brought back later, if required to support periodic diagnostics. JIRA NVGPU-7353 Change-Id: I38536019dc7165e6a97674863b37d009854af948 Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2655958 Reviewed-by: svcacv <svcacv@nvidia.com> Reviewed-by: Shashank Singh <shashsingh@nvidia.com> Reviewed-by: Vijayakumar Subbu <vsubbu@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com> Tested-by: mobile promotions <svcmobile_promotions@nvidia.com> GVS: Gerrit_Virtual_Submit	2022-01-24 02:06:24 -08:00
tkudav	0526e7eaa9	gpu: nvgpu: Create CIC-mon and CIC-rm subunits common.cic unit is divided into common.cic.mon and common.cic.rm based on rm and mon process split. CIC-mon subunit includes the code which is utilized in critical interrupt handling path like initialization, error detection and error reporting path. CIC-rm subunit includes the code corresponding to rest of interrupt handling(like collecting error debug data from registers) and ISR status management (status of deferred interrupts). Split the CIC APIs and data-members into above two subunits. JIRA NVGPU-6899 Change-Id: I151b59105ff570607c4a62e974785e9c1323ef69 Signed-off-by: Tejal Kudav <tkudav@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2551897 Tested-by: mobile promotions <svcmobile_promotions@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>	2021-07-02 09:57:56 -07:00
Tejal Kudav	9f43914933	gpu: nvgpu: Move Intr handling common code to CIC CIC (Central Interrupt controller) will be responsible for the interrupt handling. common.cic unit is the placeholder for all interrupt related code. Move interrupt related defines and Public APIs present in common.mc to common.cic. Note: The common.mc interrupts related struct definitions are not moved as part of this patch. Adapt the code to use interrupt handling related defines and public APIs migrated from common.mc to common.cic JIRA NVGPU-6899 Change-Id: I747e2b556c0dd66d58d74ee5bb36768b9370d276 Signed-off-by: Tejal Kudav <tkudav@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2535618 Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com> Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com> Reviewed-by: svc_kernel_abi <svc_kernel_abi@nvidia.com> Reviewed-by: Deepak Nibade <dnibade@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com> GVS: Gerrit_Virtual_Submit Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>	2021-05-31 19:37:31 -07:00
Tejal Kudav	e0a1fcf5f5	gpu: nvgpu: Add Central Intr Controller unit Add a new Central Interrupt Controller(CIC) unit in common code. The interrupt handling is done in a distributed manner currently. The error handling policy for different errors resides in each unit's ISR code. The goal is to converge this data under one central place - the CIC unit. This patch creates framework for CIC unit and moves the gv11b QNX safety LUT to CIC unit. All the error reporting APIs from different units are also moved to CIC. New APIs are exposed by CIC unit to access its internal data like: 1. Struct err_desc - the static err handling /injection data per error id 2. Num_hw_modules - the number of error reporting HW units supported by CIC Init and deinit of CIC unit: 1. CIC unit should be initialized earlyon during boot so that it is available for any interrupt handling. 2. Initialize CIC just before the interrupts are enabled during boot. 3. Similarly, CIC is disabled late during deinit cycle; right after the interrupts are masked. LUT: 1. LUT is currently used only for reporting error to safety services in gv11b QNX safety build. 2. This error handling policy LUT currently has only two levels of handing - correctable and quiecse. 3. Once, the error handling policy decision is moved from leaf unit nodes to CIC, LUT will be updated to have additional levels like fast recovery and full recovery. 4. Also, then a separate LUT will be added for each platform/build. 5. In current framework, the LUT is set to NULL for all configurations except gv11b. report_err() ops is added to report error to safety services. This ops is only effective for gv11b qnx build; and set to NULL for other configurations. NVGPU-6521 NVGPU-6523 NVGPU-6750 NVGPU-6758 NVGPU-6760 NVGPU-6754 Change-Id: I24be7836a96d787741e37b732e19863ed8014635 Signed-off-by: Tejal Kudav <tkudav@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2518683 Reviewed-by: Ajesh K V <akv@nvidia.com> Reviewed-by: Alex Waterman <alexw@nvidia.com> Reviewed-by: Deepak Nibade <dnibade@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com> GVS: Gerrit_Virtual_Submit Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>	2021-05-25 14:28:04 -07:00
Vedashree Vidwans	3b6eb25668	gpu: nvgpu: update common.ce doxygen comments - Update function documentation - Rename test_setup_env and test_free_env to test_ce_setup_env and test_ce_free_env respectively. This will make sure that ce test has independent setup and free functions. - Add doxygen comments for nonstall workqueue operations in mc. Jira NVGPU-6249 Signed-off-by: Vedashree Vidwans <vvidwans@nvidia.com> Change-Id: I451d7b77e9a4f60fa30a9de640d423199c1a206c Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2459666 (cherry picked from commit b367d3d0ea7f92b722aebdc8cf6018979fe74c47) Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2480584 Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com> Reviewed-by: Alex Waterman <alexw@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com> GVS: Gerrit_Virtual_Submit Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>	2021-03-04 11:04:25 -08:00
Vedashree Vidwans	e0dd79cd43	gpu: nvgpu: rearch mc reset and enable hals Remove current mc hals - mc.reset() - mc.enable() - mc.disable() - mc.reset_mask() - mc.reset_engine() - mc.reset_engine_enable() Add new mc hals - mc.enable_units(g, units, enable) > enable/disable given unit(s) - mc.enable_dev(g, dev, enable) > enable/disable engine represented by given device pointer - mc.enable_devtype(g, devtype) > enable/disable all engines of given devtype Move common mc intr functions to common/mc/mc_intr.c. Add below common mc functions - nvgpu_mc_reset_units(g, units) > reset given logical OR of nvgpu unit bitmap - nvgpu_mc_reset_dev(g, dev) > reset given single engine via dev > if engine is graphics, reset gpcs for nvgpu_next - nvgpu_mc_reset_devtype(g, devtype) > reset all engines of given devtype > if devtype is graphics, reset gpcs for nvgpu_next Bug 200648985 Bug 3109773 Change-Id: Idc67a14a0a7cde83de44fbfbec13007fead3ed5c Signed-off-by: Vedashree Vidwans <vvidwans@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2408523 Tested-by: mobile promotions <svcmobile_promotions@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>	2020-12-15 14:13:28 -06:00
Alex Waterman	0f501d806f	gpu: nvgpu: Unit test fixes and staging for device rework Fix up what unit tests can be easily fixed up. Stage everything else. In short the unit test code is _incredibly_ fragile since it's designed to hit every branch, positive and negative, in the code. However, the result of that is unit tests that are painful to modify. A lot of unit tests are also extremely opaque and rely on internal nvgpu behavior. This patch will be updated with fixes as I make them. Or, alternatively, it may be worth just temporarily disabling unit tests on dev-main. We'll have a _lot_ of work for Orin that will essentially gut the gr, host, and interrupt code. If we retain the unit test code for this, it may end up being backgreaking. JIRA NVGPU-5421 Change-Id: I8055fc72521f6a3a8a0d8f07fbe50c649a675016 Signed-off-by: Alex Waterman <alexw@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2347274 Reviewed-by: Konsta Holtta <kholtta@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com> Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com> GVS: Gerrit_Virtual_Submit Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>	2020-12-15 14:13:28 -06:00
Alex Waterman	5f0fdf085c	nvgpu: unit: Add new mock register framework Many tests used various incarnations of the mock register framework. This was based on a dump of gv11b registers. Tests that greatly benefitted from having generally sane register values all rely heavily on this framework. However, every test essentially did their own thing. This was not efficient and has caused a some issues in cleaning up the device and host code. Therefore introduce a much leaner and simplified register framework. All unit tests now automatically get a good subset of the gv11b registers auto-populated. As part of this also populate the HAL with a nvgpu_detect_chip() call. Many tests can now _probably_ have all their HAL init (except dummy HAL stuff) deleted. But this does require a few fixups here and there to set HALs to NULL where tests expect HALs to be NULL by default. Where necessary HALs are cleared with a memset to prevent unwanted code from executing. Overall, this imposes a far smaller burden on tests to initialize their environments. Something to consider for the future, though, is how to handle supporting multiple chips in the unit test world. JIRA NVGPU-5422 Change-Id: Icf1a63f728e9c5671ee0fdb726c235ffbd2843e2 Signed-off-by: Alex Waterman <alexw@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2335334 Tested-by: mobile promotions <svcmobile_promotions@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>	2020-12-15 14:13:28 -06:00
Philip Elcan	e195c05bcf	gpu: nvgpu: unit: ce: add missing target Add missing target for gp10b_ce_stall_isr since it is called directly by gv11b_ce_stall_isr. JIRA NVGPU-4818 Change-Id: If9bc17335333cb4cbec56e60168e0fa22663a22c Signed-off-by: Philip Elcan <pelcan@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2280639 Reviewed-by: Automatic_Commit_Validation_User Reviewed-by: Vinod Gopalakrishnakurup <vinodg@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com> Tested-by: mobile promotions <svcmobile_promotions@nvidia.com> GVS: Gerrit_Virtual_Submit	2020-12-15 14:10:29 -06:00
Philip Elcan	9fc6bf3a4b	gpu: nvgpu: unit: ce: switch to gops Change the unit test to use gops instead of the HALs directly. Update the Targets SWUTS keyword to reflect the change. This allows linkage to test coverage in the SWUD. JIRA NVGPU-4818 Change-Id: Ib87b4cdb68112eda89ba8844677b46c67b5e6745 Signed-off-by: Philip Elcan <pelcan@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2279484 Reviewed-by: Automatic_Commit_Validation_User GVS: Gerrit_Virtual_Submit Reviewed-by: Thomas Fleury <tfleury@nvidia.com> Reviewed-by: Alex Waterman <alexw@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com> Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>	2020-12-15 14:10:29 -06:00
Nicolas Benech	b682091b13	gpu: nvgpu: SWUTS: clean up test types Apply the following changes to test types: * "Init" --> "Other (setup)" * "Coverage" --> Removed since it's implied for all tests * "Feature based" --> "Feature" * "Boundary Value analysis" and "Boundary values based" --> "Boundary values" * "Error guessing based" --> "Error guessing" JIRA NVGPU-3510 Change-Id: I3a9c0c59e6ad806f3479caa5e9a62f4d89f76923 Signed-off-by: Nicolas Benech <nbenech@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/2265670 Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com> Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>	2020-12-15 14:10:29 -06:00
Philip Elcan	76adb91f60	gpu: nvgpu: unit: add CE unit test Add unit test for the common.ce unit and the gv11b CE FUSA HALs. JIRA NVGPU-930 Change-Id: Idee75a1a5b53d397047edbead0db68ae999ce640 Signed-off-by: Philip Elcan <pelcan@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/2255473 Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com> Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>	2020-12-15 14:10:29 -06:00

16 Commits