Commit 657bd90c authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core scheduler updates:

   - Add CONFIG_PREEMPT_DYNAMIC: this in its current form adds the
     preempt=none/voluntary/full boot options (default: full), to allow
     distros to build a PREEMPT kernel but fall back to close to
     PREEMPT_VOLUNTARY (or PREEMPT_NONE) runtime scheduling behavior via
     a boot time selection.

     There's also the /debug/sched_debug switch to do this runtime.

     This feature is implemented via runtime patching (a new variant of
     static calls).

     The scope of the runtime patching can be best reviewed by looking
     at the sched_dynamic_update() function in kernel/sched/core.c.

     ( Note that the dynamic none/voluntary mode isn't 100% identical,
       for example preempt-RCU is available in all cases, plus the
       preempt count is maintained in all models, which has runtime
       overhead even with the code patching. )

     The PREEMPT_VOLUNTARY/PREEMPT_NONE models, used by the vast
     majority of distributions, are supposed to be unaffected.

   - Fix ignored rescheduling after rcu_eqs_enter(). This is a bug that
     was found via rcutorture triggering a hang. The bug is that
     rcu_idle_enter() may wake up a NOCB kthread, but this happens after
     the last generic need_resched() check. Some cpuidle drivers fix it
     by chance but many others don't.

     In true 2020 fashion the original bug fix has grown into a 5-patch
     scheduler/RCU fix series plus another 16 RCU patches to address the
     underlying issue of missed preemption events. These are the initial
     fixes that should fix current incarnations of the bug.

   - Clean up rbtree usage in the scheduler, by providing & using the
     following consistent set of rbtree APIs:

       partial-order; less() based:
         - rb_add(): add a new entry to the rbtree
         - rb_add_cached(): like rb_add(), but for a rb_root_cached

       total-order; cmp() based:
         - rb_find(): find an entry in an rbtree
         - rb_find_add(): find an entry, and add if not found

         - rb_find_first(): find the first (leftmost) matching entry
         - rb_next_match(): continue from rb_find_first()
         - rb_for_each(): iterate a sub-tree using the previous two

   - Improve the SMP/NUMA load-balancer: scan for an idle sibling in a
     single pass. This is a 4-commit series where each commit improves
     one aspect of the idle sibling scan logic.

   - Improve the cpufreq cooling driver by getting the effective CPU
     utilization metrics from the scheduler

   - Improve the fair scheduler's active load-balancing logic by
     reducing the number of active LB attempts & lengthen the
     load-balancing interval. This improves stress-ng mmapfork
     performance.

   - Fix CFS's estimated utilization (util_est) calculation bug that can
     result in too high utilization values

  Misc updates & fixes:

   - Fix the HRTICK reprogramming & optimization feature

   - Fix SCHED_SOFTIRQ raising race & warning in the CPU offlining code

   - Reduce dl_add_task_root_domain() overhead

   - Fix uprobes refcount bug

   - Process pending softirqs in flush_smp_call_function_from_idle()

   - Clean up task priority related defines, remove *USER_*PRIO and
     USER_PRIO()

   - Simplify the sched_init_numa() deduplication sort

   - Documentation updates

   - Fix EAS bug in update_misfit_status(), which degraded the quality
     of energy-balancing

   - Smaller cleanups"

* tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  sched,x86: Allow !PREEMPT_DYNAMIC
  entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point
  entry: Explicitly flush pending rcuog wakeup before last rescheduling point
  rcu/nocb: Trigger self-IPI on late deferred wake up before user resume
  rcu/nocb: Perform deferred wake up before last idle's need_resched() check
  rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
  sched/features: Distinguish between NORMAL and DEADLINE hrtick
  sched/features: Fix hrtick reprogramming
  sched/deadline: Reduce rq lock contention in dl_add_task_root_domain()
  uprobes: (Re)add missing get_uprobe() in __find_uprobe()
  smp: Process pending softirqs in flush_smp_call_function_from_idle()
  sched: Harden PREEMPT_DYNAMIC
  static_call: Allow module use without exposing static_call_key
  sched: Add /debug/sched_preempt
  preempt/dynamic: Support dynamic preempt with preempt= boot option
  preempt/dynamic: Provide irqentry_exit_cond_resched() static call
  preempt/dynamic: Provide preempt_schedule[_notrace]() static calls
  preempt/dynamic: Provide cond_resched() and might_resched() static calls
  preempt: Introduce CONFIG_PREEMPT_DYNAMIC
  static_call: Provide DEFINE_STATIC_CALL_RET0()
  ...
parents 7b15c27e c5e6fc08
Loading
Loading
Loading
Loading
+7 −0
Original line number Diff line number Diff line
@@ -3903,6 +3903,13 @@
			Format: {"off"}
			Disable Hardware Transactional Memory

	preempt=	[KNL]
			Select preemption mode if you have CONFIG_PREEMPT_DYNAMIC
			none - Limited to cond_resched() calls
			voluntary - Limited to cond_resched() and might_sleep() calls
			full - Any section that isn't explicitly preempt disabled
			       can be preempted anytime.

	print-fatal-signals=
			[KNL] debug: print fatal signals

+169 −0
Original line number Diff line number Diff line


NOTE; all this assumes a linear relation between frequency and work capacity,
we know this is flawed, but it is the best workable approximation.


PELT (Per Entity Load Tracking)
-------------------------------

With PELT we track some metrics across the various scheduler entities, from
individual tasks to task-group slices to CPU runqueues. As the basis for this
we use an Exponentially Weighted Moving Average (EWMA), each period (1024us)
is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute
half, while the rest of history contribute the other half.

Specifically:

  ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ...

  ewma(u) = ewma_sum(u) / ewma_sum(1)

Since this is essentially a progression of an infinite geometric series, the
results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property
is key, since it gives the ability to recompose the averages when tasks move
around.

Note that blocked tasks still contribute to the aggregates (task-group slices
and CPU runqueues), which reflects their expected contribution when they
resume running.

Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
reflects the time an entity spends on the CPU, while 'runnable' reflects the
time an entity spends on the runqueue. When there is only a single task these
two metrics are the same, but once there is contention for the CPU 'running'
will decrease to reflect the fraction of time each task spends on the CPU
while 'runnable' will increase to reflect the amount of contention.

For more detail see: kernel/sched/pelt.c


Frequency- / CPU Invariance
---------------------------

Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
a big CPU, we allow architectures to scale the time delta with two ratios, one
Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.

For simple DVFS architectures (where software is in full control) we trivially
compute the ratio as:

	    f_cur
  r_dvfs := -----
            f_max

For more dynamic systems where the hardware is in control of DVFS we use
hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio.
For Intel specifically, we use:

	   APERF
  f_cur := ----- * P0
	   MPERF

	     4C-turbo;	if available and turbo enabled
  f_max := { 1C-turbo;	if turbo enabled
	     P0;	otherwise

                    f_cur
  r_dvfs := min( 1, ----- )
                    f_max

We pick 4C turbo over 1C turbo to make it slightly more sustainable.

r_cpu is determined as the ratio of highest performance level of the current
CPU vs the highest performance level of any other CPU in the system.

  r_tot = r_dvfs * r_cpu

The result is that the above 'running' and 'runnable' metrics become invariant
of DVFS and CPU type. IOW. we can transfer and compare them between CPUs.

For more detail see:

 - kernel/sched/pelt.h:update_rq_clock_pelt()
 - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
 - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"


UTIL_EST / UTIL_EST_FASTUP
--------------------------

Because periodic tasks have their averages decayed while they sleep, even
though when running their expected utilization will be the same, they suffer a
(DVFS) ramp-up after they are running again.

To alleviate this (a default enabled option) UTIL_EST drives an Infinite
Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
filter to instantly increase and only decay on decrease.

A further runqueue wide sum (of runnable tasks) is maintained of:

  util_est := \Sum_t max( t_running, t_util_est_ewma )

For more detail see: kernel/sched/fair.c:util_est_dequeue()


UCLAMP
------

It is possible to set effective u_min and u_max clamps on each CFS or RT task;
the runqueue keeps an max aggregate of these clamps for all running tasks.

For more detail see: include/uapi/linux/sched/types.h


Schedutil / DVFS
----------------

Every time the scheduler load tracking is updated (task wakeup, task
migration, time progression) we call out to schedutil to update the hardware
DVFS state.

The basis is the CPU runqueue's 'running' metric, which per the above it is
the frequency invariant utilization estimate of the CPU. From this we compute
a desired frequency like:

             max( running, util_est );	if UTIL_EST
  u_cfs := { running;			otherwise

               clamp( u_cfs + u_rt , u_min, u_max );	if UCLAMP_TASK
  u_clamp := { u_cfs + u_rt;				otherwise

  u := u_clamp + u_irq + u_dl;		[approx. see source for more detail]

  f_des := min( f_max, 1.25 u * f_max )

XXX IO-wait; when the update is due to a task wakeup from IO-completion we
boost 'u' above.

This frequency is then used to select a P-state/OPP or directly munged into a
CPPC style request to the hardware.

XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
required to satisfy the workload.

Because these callbacks are directly from the scheduler, the DVFS hardware
interaction should be 'fast' and non-blocking. Schedutil supports
rate-limiting DVFS requests for when hardware interaction is slow and
expensive, this reduces effectiveness.

For more information see: kernel/sched/cpufreq_schedutil.c


NOTES
-----

 - On low-load scenarios, where DVFS is most relevant, the 'running' numbers
   will closely reflect utilization.

 - In saturated scenarios task movement will cause some transient dips,
   suppose we have a CPU saturated with 4 tasks, then when we migrate a task
   to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
   new CPU will gain 0.25. This is inevitable and time progression will
   correct this. XXX do we still guarantee f_max due to no idle-time?

 - Much of the above is about avoiding DVFS dips, and independent DVFS domains
   having to re-learn / ramp-up when load shifts.
+9 −0
Original line number Diff line number Diff line
@@ -1058,6 +1058,15 @@ config HAVE_STATIC_CALL_INLINE
	bool
	depends on HAVE_STATIC_CALL

config HAVE_PREEMPT_DYNAMIC
	bool
	depends on HAVE_STATIC_CALL
	depends on GENERIC_ENTRY
	help
	   Select this if the architecture support boot time preempt setting
	   on top of static calls. It is strongly advised to support inline
	   static call to avoid any overhead.

config ARCH_WANT_LD_ORPHAN_WARN
	bool
	help
+1 −1
Original line number Diff line number Diff line
@@ -72,7 +72,7 @@ static struct timer_list spuloadavg_timer;
#define DEF_SPU_TIMESLICE	(100 * HZ / (1000 * SPUSCHED_TICK))

#define SCALE_PRIO(x, prio) \
	max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_SPU_TIMESLICE)
	max(x * (MAX_PRIO - prio) / (NICE_WIDTH / 2), MIN_SPU_TIMESLICE)

/*
 * scale user-nice values [ -20 ... 0 ... 19 ] to time slice values:
+1 −0
Original line number Diff line number Diff line
@@ -224,6 +224,7 @@ config X86
	select HAVE_STACK_VALIDATION		if X86_64
	select HAVE_STATIC_CALL
	select HAVE_STATIC_CALL_INLINE		if HAVE_STACK_VALIDATION
	select HAVE_PREEMPT_DYNAMIC
	select HAVE_RSEQ
	select HAVE_SYSCALL_TRACEPOINTS
	select HAVE_UNSTABLE_SCHED_CLOCK
Loading