Commit 4ecfeff5 authored by Jiexun Wang's avatar Jiexun Wang Committed by Liu Shixin
Browse files

mm/madvise: add cond_resched() in madvise_cold_or_pageout_pte_range()

mainline inclusion
from mainline-v6.7-rc5
commit b2f557a21bc8fffdcd65794eda8a854e024999f3
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b2f557a21bc8fffdcd65794eda8a854e024999f3

--------------------------------

I conducted real-time testing and observed that
madvise_cold_or_pageout_pte_range() causes significant latency under
memory pressure, which can be effectively reduced by adding cond_resched()
within the loop.

I tested on the LicheePi 4A board using Cylictest for latency testing and
Ftrace for latency tracing.  The board uses TH1520 processor and has a
memory size of 8GB.  The kernel version is 6.5.0 with the PREEMPT_RT patch
applied.

The script I tested is as follows:

echo wakeup_rt > /sys/kernel/tracing/current_tracer
echo 1 > /sys/kernel/tracing/tracing_on
echo 0 > /sys/kernel/tracing/tracing_max_latency
stress-ng --vm 8 --vm-bytes 2G &
cyclictest --mlockall --smp --priority=99 --distance=0 --duration=30m
echo 0 > /sys/kernel/tracing/tracing_on
cat /sys/kernel/tracing/trace

The tracing results before modification are as follows:

stress-n-206       3dn.h512    2us :      206:120:R   + [003]     196:  0:R cyclictest
stress-n-206       3dn.h512    7us : <stack trace>
 => __ftrace_trace_stack
 => __trace_stack
 => probe_wakeup
 => ttwu_do_activate
 => try_to_wake_up
 => wake_up_process
 => hrtimer_wakeup
 => __hrtimer_run_queues
 => hrtimer_interrupt
 => riscv_timer_interrupt
 => handle_percpu_devid_irq
 => generic_handle_domain_irq
 => riscv_intc_irq
 => handle_riscv_irq
 => do_irq
stress-n-206       3dn.h512    9us#: 0
stress-n-206       3d...3.. 2544us : __schedule
stress-n-206       3d...3.. 2545us :      206:120:R ==> [003]     196:  0:R cyclictest
stress-n-206       3d...3.. 2551us : <stack trace>
 => __ftrace_trace_stack
 => __trace_stack
 => probe_wakeup_sched_switch
 => __schedule
 => preempt_schedule
 => migrate_enable
 => rt_spin_unlock
 => madvise_cold_or_pageout_pte_range
 => walk_pgd_range
 => __walk_page_range
 => walk_page_range
 => madvise_pageout
 => madvise_vma_behavior
 => do_madvise
 => sys_madvise
 => do_trap_ecall_u
 => ret_from_exception

The tracing results after modification are as follows:

stress-n-232       0dn.h413    1us+:      232:120:R   + [000]     217:  0:R cyclictest
stress-n-232       0dn.h413   12us : <stack trace>
 => __ftrace_trace_stack
 => __trace_stack
 => probe_wakeup
 => ttwu_do_activate
 => try_to_wake_up
 => wake_up_process
 => hrtimer_wakeup
 => __hrtimer_run_queues
 => hrtimer_interrupt
 => riscv_timer_interrupt
 => handle_percpu_devid_irq
 => generic_handle_domain_irq
 => riscv_intc_irq
 => handle_riscv_irq
 => do_irq
stress-n-232       0dn.h413   19us#: 0
stress-n-232       0d...3.. 1671us : __schedule
stress-n-232       0d...3.. 1676us+:      232:120:R ==> [000]     217:  0:R cyclictest
stress-n-232       0d...3.. 1687us : <stack trace>
 => __ftrace_trace_stack
 => __trace_stack
 => probe_wakeup_sched_switch
 => __schedule
 => preempt_schedule
 => migrate_enable
 => free_unref_page_list
 => release_pages
 => free_pages_and_swap_cache
 => tlb_batch_pages_flush
 => tlb_flush_mmu
 => unmap_page_range
 => unmap_vmas
 => unmap_region
 => do_vmi_align_munmap.constprop.0
 => do_vmi_munmap
 => __vm_munmap
 => sys_munmap
 => do_trap_ecall_u
 => ret_from_exception

After the modification, the cause of maximum latency is no longer
madvise_cold_or_pageout_pte_range(), so this modification can reduce the
latency caused by madvise_cold_or_pageout_pte_range().

Currently the madvise_cold_or_pageout_pte_range() function exhibits
significant latency under memory pressure, which can be effectively
reduced by adding cond_resched() within the loop.

When the batch_count reaches SWAP_CLUSTER_MAX, we reschedule
the task to ensure fairness and avoid long lock holding times.

Link: https://lkml.kernel.org/r/85363861af65fac66c7a98c251906afc0d9c8098.1695291046.git.wangjiexun@tinylab.org


Signed-off-by: default avatarJiexun Wang <wangjiexun@tinylab.org>
Cc: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
[ Dep-of: 3931b871c493 ("mm: madvise: avoid split during MADV_PAGEOUT
  and MADV_COLD"). ]
Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
parent bb6adb65
Loading
Loading
Loading
Loading
+11 −0
Original line number Diff line number Diff line
@@ -379,6 +379,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
	struct folio *folio = NULL;
	LIST_HEAD(folio_list);
	bool pageout_anon_only_filter;
	unsigned int batch_count = 0;

	if (fatal_signal_pending(current))
		return -EINTR;
@@ -460,6 +461,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
regular_folio:
#endif
	tlb_change_page_size(tlb, PAGE_SIZE);
restart:
	start_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
	if (!start_pte)
		return 0;
@@ -468,6 +470,15 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
	for (; addr < end; pte++, addr += PAGE_SIZE) {
		ptent = ptep_get(pte);

		if (++batch_count == SWAP_CLUSTER_MAX) {
			batch_count = 0;
			if (need_resched()) {
				pte_unmap_unlock(start_pte, ptl);
				cond_resched();
				goto restart;
			}
		}

		if (pte_none(ptent))
			continue;