Account NvMap allocated memory into both RSS and CG tracking to make
efficient OOM kill decisions during memory pressure.
NvMap allocates memory via kernel APIs like alloc_pages, the kernel
memory is not accounted on behalf of process who requests the
allocation. Hence in case OOM, the OOM killer never kills the process
who has allocated memory via NvMap even though this process might be
holding most of the memory.
Solve this issue using following approach:
- Use __GFP_ACCOUNT and __GFP_NORETRY flag
- __GFP_NORETRY will not let the current allocation flow to go into OOM
path, so that it will never trigger OOM.
- __GFP_ACCOUNT causes the allocation to be accounted to kmemcg. So any
allocation done by NvMap will be definitely accounted to kmemcg and
cgroups can be used to define memory limits.
- Add RSS counting for the process which allocates by NvMap, so that OOM
score for that process will get updated and OOM killer can pick this
process based upon the OOM score.
- Every process that has a reference to NvMap Handle would have the
memory size accounted into its RSS. On releasing the reference to
handle, the RSS would be reduced.
Bug 5222690
Change-Id: I3fa9b76ec9fc8d7f805111cb96e11e2ab1db42ce
Signed-off-by: Ketan Patil <ketanp@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3447072
GVS: buildbot_gerritrpt <buildbot_gerritrpt@nvidia.com>
Reviewed-by: Krishna Reddy <vdumpa@nvidia.com>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Ajay Nandakumar Mannargudi <anandakumarm@nvidia.com>
The camera-diagnostic driver is missing a release function for the IVC
channel device that it is creating. This causes the following WARNING to
be observed when probing the camera-diagnostic driver fails ...
------------[ cut here ]------------
Device 'camera-diag' does not have a release() function, it is broken
and must be fixed. See Documentation/core-api/kobject.rst.
WARNING: CPU: 8 PID: 756 at drivers/base/core.c:2517 device_release+0x88/0xa8
...
Call trace:
device_release+0x88/0xa8
kobject_put+0xac/0x150
put_device+0x14/0x34
__mod_of__camera_diag_of_match_device_table+0x602514/0x6033a0
[camera_diagnostics]
tegra_ivc_bus_boot_sync+0x204/0x28c [ivc_bus]
really_probe+0x150/0x2c8
__driver_probe_device+0x78/0x134
driver_probe_device+0x3c/0x164
__driver_attach+0x98/0x1c4
bus_for_each_dev+0x7c/0xf4
driver_attach+0x24/0x38
bus_add_driver+0xec/0x218
driver_register+0x5c/0x13c
tegra_ivc_driver_register+0x10/0x1c [ivc_bus]
init_module+0x18/0x1000
[camera_diagnostics]
do_one_initcall+0x58/0x318
do_init_module+0x58/0x1ec
load_module+0x1f04/0x2000
init_module_from_file+0x88/0xd4
__arm64_sys_finit_module+0x148/0x330
invoke_syscall+0x48/0x134
el0_svc_common.constprop.0+0x40/0xf0
do_el0_svc+0x1c/0x30
el0_svc+0x30/0xb8
el0t_64_sync_handler+0x130/0x13c
el0t_64_sync+0x194/0x198
---[ end trace 0000000000000000 ]---
Ideally we would use the 'tegra_ivc_channel_release()' function as the
release function but because the camera-diagnostic driver does not call
'tegra_ivc_channel_create()' to create the channel and does not actually
need to free any memory, simply define a new empty function that can be
called.
Bug 5489551
Change-Id: I6a7de9893af0167409af79599b61faf005047b0e
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3443095
(cherry picked from commit f75c0f228ca4f1a4c17266cb97be2db63b25a592)
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3453753
GVS: buildbot_gerritrpt <buildbot_gerritrpt@nvidia.com>
Reviewed-by: Brad Griffis <bgriffis@nvidia.com>
A function called hrtimer_setup() was added in Linux v6.13 with the
intent that it would eventually replace hrtimer_init(). In Linux v6.15
the hrtimer_init() function was removed.
Use conftest to call the appropriate API based on what is defined
in the kernel.
Bug 5466808
Change-Id: I1c1c4e81c840a058d8c4c0b1616c87cb8a8a8beb
Signed-off-by: Brad Griffis <bgriffis@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3436079
GVS: buildbot_gerritrpt <buildbot_gerritrpt@nvidia.com>
Reviewed-by: Revanth Kumar Uppala <ruppala@nvidia.com>
The 'no_llseek' definition was removed in Linux v6.12. Use
NV_NO_LLSEEK_PRESENT to check if it should be defined.
The 'remove' callback of the 'platform_driver" structure was updated in
Linux v6.11 to return void instead of int.
Update rtcpu-coe.c so that it properly handles the above cases.
Bug 5466808
Change-Id: I9306840f0b4a9e5a59a5c161ac3c58af2a70a4ed
Signed-off-by: Brad Griffis <bgriffis@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3436078
Reviewed-by: Igor Mitsyanko <imitsyanko@nvidia.com>
Reviewed-by: svcacv <svcacv@nvidia.com>
GVS: buildbot_gerritrpt <buildbot_gerritrpt@nvidia.com>
Issue - When WiFi operating channel is switched, at times the wifi
role index and role bitmap show that there is already a role
assigned for the channel context and this causes a failure in association. Kernel warning is shown when this occurs.
Fix - Update driver to v126-10 that fixes this issue.
[ 57.590860] Call trace:
[ 57.590861] rtw_phl_chanctx_add+0x528/0x8f4 [rtl8852ce]
[ 57.590947] rtw_clear_is_accepted_status+0x4a4/0xbb8 [rtl8852ce]
[ 57.591033] cur_req_hdl+0x3c/0x4c [rtl8852ce]
[ 57.591118] msg_dispatch+0x2dc/0x3f8 [rtl8852ce]
[ 57.591204] dispr_thread_loop_hdl+0x270/0x2dc [rtl8852ce]
[ 57.591289] dispr_share_thread_loop_hdl+0x10/0x1c [rtl8852ce]
[ 57.591374] share_thread_hdl+0xb8/0x1a0 [rtl8852ce]
[ 57.591459] kthread+0x110/0x124
[ 57.591466] ret_from_fork+0x10/0x20
Bug 5440351
Bug 5442104
Change-Id: Ie78c70c1ea7a789351a2ba4ad445c4d0062281da
Signed-off-by: Shobek Attupurath <sattupurath@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3426784
GVS: buildbot_gerritrpt <buildbot_gerritrpt@nvidia.com>
Reviewed-by: Ashutosh Jha <ajha@nvidia.com>
OpenRM maps the buffer with remap_pfn_range and then it's user VA is
passed to libnvrm_mem to create a handle out of it. NvMap uses
get_user_pages to get user pages from the VA. It fails for the buffer
mapped with remap_pfn_range. Then it fallbacks to follow_pfn or
follow_pfnmap_start functions to obtain pfn from the VA and then obtain
page pointer from it. But as get_user_pages fails, the corresponding
error prints are getting generated even when not required. Hence reduce
the log level to debug to avoid these unnecessary spews.
Bug 5383624
Change-Id: Idbd3cfe93ce3fac6e27efc5c3bb7a200fc183d26
Signed-off-by: Ketan Patil <ketanp@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3425552
GVS: buildbot_gerritrpt <buildbot_gerritrpt@nvidia.com>
Reviewed-by: Pritesh Raithatha <praithatha@nvidia.com>
Problem:
- When a Downstream Port Containment (DPC) software trigger is issued, the LTR_EN bit in the Root Port (RP) is cleared as per PCIe spec.
- However, LTR_EN bit of RTL8126 endpoint (EP) which is being expected to reset is still active and sends Latency Tolerance Reports (LTR) to RP.
- This behavior violates the PCIe spec, as LTR_EN is a non-sticky bit and should be cleared automatically on reset.
- As the RP has LTR disabled but the EP still sends LTR messages, it results in Unsupported Request (UR) errors on the RP.
- These UR errors trigger AER (Advanced Error Reporting) recovery, which includes a Secondary Bus Reset (SBR).
- The SBR causes the PCIe link to go down and come back up, but the EP again starts sending LTRs, leading to a infinite error-recovery loop.
Workaround:
- As a temporary fix, disable the LTR_EN bit in the RTL8126 EP during its probe.
- This prevents the EP from sending LTR messages, thereby avoiding UR errors and breaking the loop of AER recovery.
Impact:
- Disabling LTR prevents the EP from entering the L1.2 low power state.
- However, ASPM is currently not enabled in the system, so this workaround has no impact.
Bug 4869463
Change-Id: Ibf7effaeb0f22e952645ef7bf6a18287264e1463
Signed-off-by: Revanth Kumar Uppala <ruppala@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3420019
Reviewed-by: Manikanta Maddireddy <mmaddireddy@nvidia.com>
Reviewed-by: Ashutosh Jha <ajha@nvidia.com>
GVS: buildbot_gerritrpt <buildbot_gerritrpt@nvidia.com>
In fringe unexpected cases, HSB (Holoscan sensor bringe) sends image
byte offset larger then allocated image size (e.g. if HSB just sends
incorrect packet, or is configured incorrectly for a different image
size. or just packet corruption).
In such cases, we run into SMMU faults.
To mitigate this, a buffer size of two check was introduced so even
were this to happen, it would not cause SMMU errors.
However, the support for this in UMD is not complete.
Therefore, disable this check until UMD is able to comply with this
buffer constraint.
Jira L4T-7463
Change-Id: I2de31740284627ca117f1fa0a28bde2ef9a82785
Signed-off-by: Rakibul Hassan <rakibulh@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3419644
Reviewed-by: Igor Mitsyanko <imitsyanko@nvidia.com>
GVS: buildbot_gerritrpt <buildbot_gerritrpt@nvidia.com>
Reviewed-by: Narendra Kondapalli <nkondapalli@nvidia.com>
Reviewed-by: svcacv <svcacv@nvidia.com>
Modify CoE capture logic a bit to make it more robust and error-proof:
- RCE Rx queue limit size is 16, no point to have 32 elements long queue
in kernel.
- Pass kernel's queue length to RCE when opening a channel so it can be
validated (to not exceed RCE max depth)
- validate image buffers IOVA addresses and buffer length before queuing
to RCE
Jira CT26X-1892
Change-Id: I199143fe726ebab05a1236d4b14b59f0528d65a8
Signed-off-by: Igor Mitsyanko <imitsyanko@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3419638
Reviewed-by: svcacv <svcacv@nvidia.com>
Tested-by: Raki Hassan <rakibulh@nvidia.com>
GVS: buildbot_gerritrpt <buildbot_gerritrpt@nvidia.com>
Reviewed-by: Narendra Kondapalli <nkondapalli@nvidia.com>
Upstream commit aa7a9275ab81 ("PM: sleep: Suspend async parents after
suspending children") triggers a suspend issue on Tegra234 Jetson
Orin Nano boards because it had reordered the suspend of devices with
async suspend enabled with respect to some other devices. This commit
is present in Linux v6.16 kernels.
The same issue was observed with the cypd4226 Type-C controller used on
other Jetson platforms and due to its dependencies on other devices it
is necessary to disable async suspend to fix the issue [0]. Fix suspend
for Tegra234 Jetson Nano platforms by disabling 'async' suspend for the
fusb301 device. Note that it is safe to disable this for all kernel
versions.
[0] https://lore.kernel.org/lkml/6180608.lOV4Wx5bFT@rjwysocki.net/
JIRA LINQPJ14-73
Change-Id: If08932406c43bca2736164a2fdd96a5a4b9fa81c
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3404885
(cherry picked from commit 21686177a6d395701cc8f19088090142657899a0)
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3411825
GVS: buildbot_gerritrpt <buildbot_gerritrpt@nvidia.com>
Reviewed-by: Brad Griffis <bgriffis@nvidia.com>
This is for avoiding kernel hang when DCE FW fails to respond.
Failures of IPC call will return -ERESTARTSYS or -ETIMEOUT, which
will be handled by caller functions:
1. tegra_dce_client_ipc_send_recv (EXPORT_SYMBOL)
This is module export symbol and caller have the responsibility
of checking return value.
2. DCE FSM event handler
Error return will change back to previous state.
DCE_IPC_TIMEOUT_MS_MAX is set to 10000[ms]
SHA computation time on SC7 entry request can go up 2sec.
Host tolerance time must be larger than this.
Jira TDS-16567
https://nvbugspro.nvidia.com/bug/5335034
Change-Id: I5d77a9497f14f305d07b98e39a58fbcecafedf92
Signed-off-by: charliej <charliej@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3358620
GVS: buildbot_gerritrpt <buildbot_gerritrpt@nvidia.com>
Reviewed-by: Mahesh Kumar <mahkumar@nvidia.com>
Reviewed-by: svcacv <svcacv@nvidia.com>
Tested-by: Mahesh Kumar <mahkumar@nvidia.com>
Reviewed-by: Vinod Gopalakrishnakurup <vinodg@nvidia.com>
(cherry picked from commit 6c2ab3c78ce7cba0e88455b263d51d1a88c03927)
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3402917
This change adds the support for programming streamids to
allow tsec fw on t264 to access PA at a low privilege level.
It also includes the synchronization logic to communicate
with the fw regarding completion of stream id programming
so that the fw can go ahead and initialize itself.
In addition to this, the mailbox used for communicating init done
from tsec fw to ccplex is changed from NV_PTSEC_FALCON_MAILBOX0 to
NV_PTSEC_MAILBOX1 since CCPLEX does not have access to the former from
t26x onwards. Hence falcon based mailboxes are used for tsec-psc comms
and non-falcon ones for tsec-ccplex comms (stream id comms and init done).
Jira TSEC-14
Change-Id: I2871a52222cd69786a8cc3f53162a80486611bb5
Signed-off-by: Sahil Patki <spatki@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3366343
Reviewed-by: Bharat Nihalani <bnihalani@nvidia.com>
GVS: buildbot_gerritrpt <buildbot_gerritrpt@nvidia.com>
(cherry picked from commit db54fde9c4d786b22b7f8694753de3ec80649b17)
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3400219
When switching the governor to nvhost_pogdov or switching back to other
governors, we will need to lock the devfreq lock to prevent triggering
DVFS cycle from other paths.
The nvhost_pod_target_freq callback will be called when triggering the
DVFS cycle. However, the callback expects governor data is already
allocated and initialized. We need to synchronize the operations when we
switch the governor so that DVFS cycle can only be triggered when
governor data is ready.
Bug 5354161
Bug 5351714
Change-Id: Iaf8af8291ea09a7c2bfbdc5e1453bb976ee0987b
Signed-off-by: Johnny Liu <johnliu@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3392341
Reviewed-by: svcacv <svcacv@nvidia.com>
Reviewed-by: Bibek Basu <bbasu@nvidia.com>
GVS: buildbot_gerritrpt <buildbot_gerritrpt@nvidia.com>
Reviewed-by: Rajesh Devaraj <rdevaraj@nvidia.com>
Reviewed-by: Rajkumar Kasirajan <rkasirajan@nvidia.com>
Android builds don't have CONFIG_NUMA enabled hence
/sys/devices/system/node/node0/meminfo is not present on android.
While nvscibuf calls the QueryHeapParams to check presence of the
hugetlbfs based carveout, the error prints will be seen due to absence
of the above sysfs file. Hence first check whethere there are multiple
numa nodes are not. If not, then use /proc/meminfo file to retrieve the
hugetlbfs size otherwise use the meminfo sysfs node from the
corresponding numa node.
Bug 5200644
Change-Id: I5495de91726d323210807e86f22757b798226fca
Signed-off-by: Ketan Patil <ketanp@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3338255
Reviewed-by: Pritesh Raithatha <praithatha@nvidia.com>
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: svcacv <svcacv@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Reviewed-by: Jian-Min Liu <jianminl@nvidia.com>
GVS: buildbot_gerritrpt <buildbot_gerritrpt@nvidia.com>