linux-nvgpu

mirror of git://nv-tegra.nvidia.com/linux-nvgpu.git synced 2025-12-22 17:36:20 +03:00

Author	SHA1	Message	Date
Prasun Kumar	4194c35e17	gpu: nvgpu: Err injection utility support Update callback registration with error inject utility. Add callback de-registration with error injection utility when CIC_MON is removed. Bug 3413214 Signed-off-by: Prasun Kumar <prasunk@nvidia.com> Change-Id: Iab682cd522a96fd6af136485c4f3b73f81f723b8 Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2755178 Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com> Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com> Reviewed-by: svcacv <svcacv@nvidia.com> Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com> Reviewed-by: Rajesh Devaraj <rdevaraj@nvidia.com> Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com> GVS: Gerrit_Virtual_Submit	2022-08-17 19:22:17 -07:00
Tejal Kudav	494dc19ee8	gpu: nvgpu: Err injection utility support The HSI error injection utility is an on-bench debug and test utility which can be used by customers and SQA to test end-to-end error detection and reporting path. Inplement callback function to integrate with this utility and allow injecting GPU HSI related errors. As part of callback function hsierrrpt_inj(), invoke the driver's error-reporting logic which uses the EPD MISC_EC APIs. In future, we can enhance the callback function to trigger driver's error handling logic incrementally for different errors. Bug 3413214 Change-Id: I2d050b6c850d6151b40095f243a6733b4ba74f47 Signed-off-by: Tejal Kudav <tkudav@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2727198 Tested-by: mobile promotions <svcmobile_promotions@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>	2022-07-01 08:11:45 -07:00
Rajesh Devaraj	fac998940c	gpu: nvgpu: enable polling support for error reporting in AV+L As per Safety_Services, a client must perform polling to ensure that the previously reported errors are cleared at FSI, in case of back-to-back error reporting. However, to minimize the polling overhead, NvGPU driver performs polling only when the error to be reported is corrected error to ensure that it is not overwriting the previously reported uncorrected/corrected error. In case of uncorrected errors, it will be reported without doing polling. This situation leads to a failure in error reporting, when uncorrected errors are reported back-to-back. This is acceptable for safety builds where SW quiesce will be triggered immediately after the reporting of first uncorrected error. In case of other build configurations, MCU/SEH takes the decision on encountering uncorrected errors. To handle such build configurations, polling is enabled for all types of errors, in all build configurations. This patch also removes an unused macro "ERR_TYPE_MASK". Bug 3622420 Change-Id: I750b0406faec9b229d8d0c74e986807234362cb9 Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2707105 Reviewed-by: Tejal Kudav <tkudav@nvidia.com> Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com> GVS: Gerrit_Virtual_Submit	2022-05-06 05:21:43 -07:00
Rajesh Devaraj	a06b5ff2d2	gpu: nvgpu: update error reporting in av+l MISC_EC is not supported in platforms like L4T. In such case, -ENODEV error code will be returned by Safety_Services. This patch updates error reporting to consider this scenario. JIRA NVGPU-8094 Bug 3366818 Bug 3491596 Change-Id: I0316571fe44418e738ae784da0584cf1040cb6cb Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2684283 Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com> Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com> Reviewed-by: Dinesh T <dt@nvidia.com> Reviewed-by: Ankur Kishore <ankkishore@nvidia.com> GVS: Gerrit_Virtual_Submit	2022-03-23 14:02:00 -07:00
Rajesh Devaraj	4652f96a6f	gpu: nvgpu: add polling for back-to-back error reporting in av+l When an error is reported to Safety_Services, it will be cleared at FSI and reported to SEH (System Error Handler). Since MISC_EC interface provides only one register for error reporting, there is a need to poll the status of previously reported error before reporting the next error. For this purpose, this patch adds logic to perform polling using epl_get_misc_ec_err_status(), in AV+L. JIRA NVGPU-8094 Bug 200729736 Change-Id: Ia01a2fc42a7ce586b7965a82c90027a9a2dd252b Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2684141 Reviewed-by: Dinesh T <dt@nvidia.com> Reviewed-by: Ankur Kishore <ankkishore@nvidia.com> GVS: Gerrit_Virtual_Submit	2022-03-21 10:51:39 -07:00
Rajesh Devaraj	86be5112b2	gpu: nvgpu: plug-in misc-ec interface for av+l This patch adds misc-ec interface into NvGPU driver to report GPU HW errors to Safety_Services, in AV+L. For this purpose, it introduces a new flag "CONFIG_NVGPU_ENABLE_MISC_EC". JIRA NVGPU-8094 Change-Id: Id8fff69487cad9ed4eb082a7d8615a1e15867ffa Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2678394 Reviewed-by: svcacv <svcacv@nvidia.com> Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com> Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com> Reviewed-by: Dinesh T <dt@nvidia.com> Reviewed-by: Ankur Kishore <ankkishore@nvidia.com>	2022-03-18 07:54:53 -07:00
Rajesh Devaraj	0699220b85	gpu: nvgpu: compile-out unused apis from safety build This patch does the following changes: - Compiles-out unused error reporting APIs and the related data structures from safety build. For this purpose, it introduces the new flag: CONFIG_NVGPU_INTR_DEBUG - Updates nvgpu_report_err_to_sdl() API with one more argument, hw_unit_id. This aids in finding whether an error to be reported is corrected or uncorrected from LUT. - Triggers SW quiesce, if an uncorrected error is reported to Safety_Services, in safety build. - Renames files in cic folder by replacing gv11b with ga10b, since error reporting for gv11b is not supported in dev-main. JIRA NVGPU-8002 Change-Id: Ic01e73b0208252abba1f615a2c98d770cdf41ca4 Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2668466 Reviewed-by: Tejal Kudav <tkudav@nvidia.com> Reviewed-by: svc-mobile-coverity <svc-mobile-coverity@nvidia.com> Reviewed-by: svc-mobile-misra <svc-mobile-misra@nvidia.com> Reviewed-by: svc-mobile-cert <svc-mobile-cert@nvidia.com> Reviewed-by: Vaibhav Kachore <vkachore@nvidia.com> GVS: Gerrit_Virtual_Submit	2022-02-14 22:00:33 -08:00
Rajesh Devaraj	7dc013d242	gpu: nvgpu: merge error reporting apis In DRIVE 6.0, NvGPU is allowed to report only 32-bit metadata to Safety_Services. So, there is no need to have distinct APIs for reporting errors from units like GR, MM, FIFO to SDL unit. All these error reporting APIs will be replaced with a single API. To meet this objective, this patch does the following changes: - Replaces nvgpu_report__err with nvgpu_report_err_to_sdl. - Removes the reporting of error messages. - Replaces nvgpu_log() with nvgpu_err(), for error reporting. - Removes error reporting to Safety_Services from nvgpu_report__err. However, nvgpu_report_*_err APIs and their related files are not removed. During the creation of nvgpu-mon, they will be moved under nvgpu-rm, in debug builds. Note: - There will be a follow-up patch to fix error IDs. - As discussed in https://nvbugs/3491596 (comment #12), the high level expectation is to report only errors. JIRA NVGPU-7450 Change-Id: I428f2a9043086462754ac36a15edf6094985316f Signed-off-by: Rajesh Devaraj <rdevaraj@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2662590 Tested-by: mobile promotions <svcmobile_promotions@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>	2022-02-09 00:41:18 -08:00
tkudav	0526e7eaa9	gpu: nvgpu: Create CIC-mon and CIC-rm subunits common.cic unit is divided into common.cic.mon and common.cic.rm based on rm and mon process split. CIC-mon subunit includes the code which is utilized in critical interrupt handling path like initialization, error detection and error reporting path. CIC-rm subunit includes the code corresponding to rest of interrupt handling(like collecting error debug data from registers) and ISR status management (status of deferred interrupts). Split the CIC APIs and data-members into above two subunits. JIRA NVGPU-6899 Change-Id: I151b59105ff570607c4a62e974785e9c1323ef69 Signed-off-by: Tejal Kudav <tkudav@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2551897 Tested-by: mobile promotions <svcmobile_promotions@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>	2021-07-02 09:57:56 -07:00
Tejal Kudav	e0a1fcf5f5	gpu: nvgpu: Add Central Intr Controller unit Add a new Central Interrupt Controller(CIC) unit in common code. The interrupt handling is done in a distributed manner currently. The error handling policy for different errors resides in each unit's ISR code. The goal is to converge this data under one central place - the CIC unit. This patch creates framework for CIC unit and moves the gv11b QNX safety LUT to CIC unit. All the error reporting APIs from different units are also moved to CIC. New APIs are exposed by CIC unit to access its internal data like: 1. Struct err_desc - the static err handling /injection data per error id 2. Num_hw_modules - the number of error reporting HW units supported by CIC Init and deinit of CIC unit: 1. CIC unit should be initialized earlyon during boot so that it is available for any interrupt handling. 2. Initialize CIC just before the interrupts are enabled during boot. 3. Similarly, CIC is disabled late during deinit cycle; right after the interrupts are masked. LUT: 1. LUT is currently used only for reporting error to safety services in gv11b QNX safety build. 2. This error handling policy LUT currently has only two levels of handing - correctable and quiecse. 3. Once, the error handling policy decision is moved from leaf unit nodes to CIC, LUT will be updated to have additional levels like fast recovery and full recovery. 4. Also, then a separate LUT will be added for each platform/build. 5. In current framework, the LUT is set to NULL for all configurations except gv11b. report_err() ops is added to report error to safety services. This ops is only effective for gv11b qnx build; and set to NULL for other configurations. NVGPU-6521 NVGPU-6523 NVGPU-6750 NVGPU-6758 NVGPU-6760 NVGPU-6754 Change-Id: I24be7836a96d787741e37b732e19863ed8014635 Signed-off-by: Tejal Kudav <tkudav@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nvgpu/+/2518683 Reviewed-by: Ajesh K V <akv@nvidia.com> Reviewed-by: Alex Waterman <alexw@nvidia.com> Reviewed-by: Deepak Nibade <dnibade@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com> GVS: Gerrit_Virtual_Submit Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>	2021-05-25 14:28:04 -07:00

10 Commits