Commit 3fe2f744 authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Cleanups for SCHED_DEADLINE

 - Tracing updates/fixes

 - CPU Accounting fixes

 - First wave of changes to optimize the overhead of the scheduler
   build, from the fast-headers tree - including placeholder *_api.h
   headers for later header split-ups.

 - Preempt-dynamic using static_branch() for ARM64

 - Isolation housekeeping mask rework; preperatory for further changes

 - NUMA-balancing: deal with CPU-less nodes

 - NUMA-balancing: tune systems that have multiple LLC cache domains per
   node (eg. AMD)

 - Updates to RSEQ UAPI in preparation for glibc usage

 - Lots of RSEQ/selftests, for same

 - Add Suren as PSI co-maintainer

* tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
  sched/headers: ARM needs asm/paravirt_api_clock.h too
  sched/numa: Fix boot crash on arm64 systems
  headers/prep: Fix header to build standalone: <linux/psi.h>
  sched/headers: Only include <linux/entry-common.h> when CONFIG_GENERIC_ENTRY=y
  cgroup: Fix suspicious rcu_dereference_check() usage warning
  sched/preempt: Tell about PREEMPT_DYNAMIC on kernel headers
  sched/topology: Remove redundant variable and fix incorrect type in build_sched_domains
  sched/deadline,rt: Remove unused parameter from pick_next_[rt|dl]_entity()
  sched/deadline,rt: Remove unused functions for !CONFIG_SMP
  sched/deadline: Use __node_2_[pdl|dle]() and rb_first_cached() consistently
  sched/deadline: Merge dl_task_can_attach() and dl_cpu_busy()
  sched/deadline: Move bandwidth mgmt and reclaim functions into sched class source file
  sched/deadline: Remove unused def_dl_bandwidth
  sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
  sched/tracing: Don't re-read p->state when emitting sched_switch event
  sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race
  sched/cpuacct: Remove redundant RCU read lock
  sched/cpuacct: Optimize away RCU read lock
  sched/cpuacct: Fix charge percpu cpuusage
  sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies
  ...
parents ebd326ce ffea9fb3
Loading
Loading
Loading
Loading
+1 −45
Original line number Diff line number Diff line
@@ -609,51 +609,7 @@ be migrated to a local memory node.
The unmapping of pages and trapping faults incur additional overhead that
ideally is offset by improved memory locality but there is no universal
guarantee. If the target workload is already bound to NUMA nodes then this
feature should be disabled. Otherwise, if the system overhead from the
feature is too high then the rate the kernel samples for NUMA hinting
faults may be controlled by the `numa_balancing_scan_period_min_ms,
numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
numa_balancing_scan_size_mb`_, and numa_balancing_settle_count sysctls.


numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
===============================================================================================================================


Automatic NUMA balancing scans tasks address space and unmaps pages to
detect if pages are properly placed or if the data should be migrated to a
memory node local to where the task is running.  Every "scan delay" the task
scans the next "scan size" number of pages in its address space. When the
end of the address space is reached the scanner restarts from the beginning.

In combination, the "scan delay" and "scan size" determine the scan rate.
When "scan delay" decreases, the scan rate increases.  The scan delay and
hence the scan rate of every task is adaptive and depends on historical
behaviour. If pages are properly placed then the scan delay increases,
otherwise the scan delay decreases.  The "scan size" is not adaptive but
the higher the "scan size", the higher the scan rate.

Higher scan rates incur higher system overhead as page faults must be
trapped and potentially data must be migrated. However, the higher the scan
rate, the more quickly a tasks memory is migrated to a local node if the
workload pattern changes and minimises performance impact due to remote
memory accesses. These sysctls control the thresholds for scan delays and
the number of pages scanned.

``numa_balancing_scan_period_min_ms`` is the minimum time in milliseconds to
scan a tasks virtual memory. It effectively controls the maximum scanning
rate for each task.

``numa_balancing_scan_delay_ms`` is the starting "scan delay" used for a task
when it initially forks.

``numa_balancing_scan_period_max_ms`` is the maximum time in milliseconds to
scan a tasks virtual memory. It effectively controls the minimum scanning
rate for each task.

``numa_balancing_scan_size_mb`` is how many megabytes worth of pages are
scanned for a given scan.

feature should be disabled.

oops_all_cpu_backtrace
======================
+1 −0
Original line number Diff line number Diff line
@@ -18,6 +18,7 @@ Linux Scheduler
    sched-nice-design
    sched-rt-group
    sched-stats
    sched-debug

    text_files

+54 −0
Original line number Diff line number Diff line
=================
Scheduler debugfs
=================

Booting a kernel with CONFIG_SCHED_DEBUG=y will give access to
scheduler specific debug files under /sys/kernel/debug/sched. Some of
those files are described below.

numa_balancing
==============

`numa_balancing` directory is used to hold files to control NUMA
balancing feature.  If the system overhead from the feature is too
high then the rate the kernel samples for NUMA hinting faults may be
controlled by the `scan_period_min_ms, scan_delay_ms,
scan_period_max_ms, scan_size_mb` files.


scan_period_min_ms, scan_delay_ms, scan_period_max_ms, scan_size_mb
-------------------------------------------------------------------

Automatic NUMA balancing scans tasks address space and unmaps pages to
detect if pages are properly placed or if the data should be migrated to a
memory node local to where the task is running.  Every "scan delay" the task
scans the next "scan size" number of pages in its address space. When the
end of the address space is reached the scanner restarts from the beginning.

In combination, the "scan delay" and "scan size" determine the scan rate.
When "scan delay" decreases, the scan rate increases.  The scan delay and
hence the scan rate of every task is adaptive and depends on historical
behaviour. If pages are properly placed then the scan delay increases,
otherwise the scan delay decreases.  The "scan size" is not adaptive but
the higher the "scan size", the higher the scan rate.

Higher scan rates incur higher system overhead as page faults must be
trapped and potentially data must be migrated. However, the higher the scan
rate, the more quickly a tasks memory is migrated to a local node if the
workload pattern changes and minimises performance impact due to remote
memory accesses. These files control the thresholds for scan delays and
the number of pages scanned.

``scan_period_min_ms`` is the minimum time in milliseconds to scan a
tasks virtual memory. It effectively controls the maximum scanning
rate for each task.

``scan_delay_ms`` is the starting "scan delay" used for a task when it
initially forks.

``scan_period_max_ms`` is the maximum time in milliseconds to scan a
tasks virtual memory. It effectively controls the minimum scanning
rate for each task.

``scan_size_mb`` is how many megabytes worth of pages are scanned for
a given scan.
+1 −0
Original line number Diff line number Diff line
@@ -15566,6 +15566,7 @@ F: drivers/net/ppp/pptp.c
PRESSURE STALL INFORMATION (PSI)
M:	Johannes Weiner <hannes@cmpxchg.org>
M:	Suren Baghdasaryan <surenb@google.com>
S:	Maintained
F:	include/linux/psi*
F:	kernel/sched/psi.c
+33 −4
Original line number Diff line number Diff line
@@ -1293,12 +1293,41 @@ config HAVE_STATIC_CALL_INLINE

config HAVE_PREEMPT_DYNAMIC
	bool

config HAVE_PREEMPT_DYNAMIC_CALL
	bool
	depends on HAVE_STATIC_CALL
	depends on GENERIC_ENTRY
	select HAVE_PREEMPT_DYNAMIC
	help
	   An architecture should select this if it can handle the preemption
	   model being selected at boot time using static calls.

	   Where an architecture selects HAVE_STATIC_CALL_INLINE, any call to a
	   preemption function will be patched directly.

	   Where an architecture does not select HAVE_STATIC_CALL_INLINE, any
	   call to a preemption function will go through a trampoline, and the
	   trampoline will be patched.

	   It is strongly advised to support inline static call to avoid any
	   overhead.

config HAVE_PREEMPT_DYNAMIC_KEY
	bool
	depends on HAVE_ARCH_JUMP_LABEL && CC_HAS_ASM_GOTO
	select HAVE_PREEMPT_DYNAMIC
	help
	   Select this if the architecture support boot time preempt setting
	   on top of static calls. It is strongly advised to support inline
	   static call to avoid any overhead.
	   An architecture should select this if it can handle the preemption
	   model being selected at boot time using static keys.

	   Each preemption function will be given an early return based on a
	   static key. This should have slightly lower overhead than non-inline
	   static calls, as this effectively inlines each trampoline into the
	   start of its callee. This may avoid redundant work, and may
	   integrate better with CFI schemes.

	   This will have greater overhead than using inline static calls as
	   the call to the preemption function cannot be entirely elided.

config ARCH_WANT_LD_ORPHAN_WARN
	bool
Loading