video: tegra: nvmap: Add multithreaded cache flush support

On TOT, NvMap does page by page cache flush i.e. it takes virtual address of each page present in the buffer and then perform cache flush on it using dcache_by_line_op. This result in very poor performance for larger buffers. ~70% of the time taken by NvRmMemHandleAllocAttr is consumed in cache flush. Address this perf issue using multithreaded cache flush - Use a threshold value of 32768 pages which is derived from perf experiments and as per discussion with cuda as per usecases. - When the cache flush request of >= 32768 pages is made, then vmap pages to map them in contiguous VA space and create n number of kernel threads; where n indicate the number of online CPUs. - Divide the above VA range among the threads and each thread would do cache flush on the VA range assigned to it. This logic in resulting into following % improvement for alloc tests. ----------------------------------- Buffer Size in MB | % improvement | ----------------------------------| 128 | 52 | 256 | 56 | 512 | 57 | 1024 | 58 | 1536 | 57 | 2048 | 58 | 2560 | 57 | 3072 | 58 | 3584 | 58 | 4096 | 58 | 4608 | 58 | 5120 | 58 | ----------------------------------- Bug 4628529 Change-Id: I803ef5245ff9283fdc3afc497a6b642c97e89c06 Signed-off-by: Ketan Patil <ketanp@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/c/linux-nv-oot/+/3187871 Reviewed-by: Krishna Reddy <vdumpa@nvidia.com> GVS: buildbot_gerritrpt <buildbot_gerritrpt@nvidia.com>
2025-12-25 10:42:21 +03:00 · 2024-08-05 04:09:13 +00:00
parent a57d56284d
commit 9feb2a4347
3 changed files with 114 additions and 11 deletions
--- a/drivers/video/tegra/nvmap/nvmap_alloc_int.h
+++ b/drivers/video/tegra/nvmap/nvmap_alloc_int.h
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
-/* SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+/* SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. */

 #ifndef __NVMAP_ALLOC_INT_H
 #define __NVMAP_ALLOC_INT_H
@@ -14,6 +14,19 @@
 #define NVMAP_PP_BIG_PAGE_SIZE           (0x10000)
 #endif /* CONFIG_ARM64_4K_PAGES */

+/*
+ * Indicate the threshold number of pages after which
+ * the multithreaded cache flush will be used.
+ */
+#define THRESHOLD_PAGES_CACHE_FLUSH 32768
+
+struct nvmap_cache_thread {
+	pid_t thread_id;
+	void *va_start;
+	size_t size;
+	struct task_struct *task;
+};
+
 struct dma_coherent_mem_replica {
 	void		*virt_base;
 	dma_addr_t	device_base;