Merge tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip (657bd90c) · Commits · EulixOS / Software / Kernel

Documentation/admin-guide/kernel-parameters.txt

+7 −0

Original line number	Diff line number	Diff line
		@@ -3903,6 +3903,13 @@
		Format: {"off"}
		Disable Hardware Transactional Memory

		preempt= [KNL]
		Select preemption mode if you have CONFIG_PREEMPT_DYNAMIC
		none - Limited to cond_resched() calls
		voluntary - Limited to cond_resched() and might_sleep() calls
		full - Any section that isn't explicitly preempt disabled
		can be preempted anytime.

		print-fatal-signals=
		[KNL] debug: print fatal signals

Documentation/scheduler/schedutil.txt

0 → 100644

+169 −0

Original line number	Diff line number	Diff line


		NOTE; all this assumes a linear relation between frequency and work capacity,
		we know this is flawed, but it is the best workable approximation.


		PELT (Per Entity Load Tracking)
		-------------------------------

		With PELT we track some metrics across the various scheduler entities, from
		individual tasks to task-group slices to CPU runqueues. As the basis for this
		we use an Exponentially Weighted Moving Average (EWMA), each period (1024us)
		is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute
		half, while the rest of history contribute the other half.

		Specifically:

		ewma_sum(u) := u_0 + u_1y + u_2y^2 + ...

		ewma(u) = ewma_sum(u) / ewma_sum(1)

		Since this is essentially a progression of an infinite geometric series, the
		results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property
		is key, since it gives the ability to recompose the averages when tasks move
		around.

		Note that blocked tasks still contribute to the aggregates (task-group slices
		and CPU runqueues), which reflects their expected contribution when they
		resume running.

		Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
		reflects the time an entity spends on the CPU, while 'runnable' reflects the
		time an entity spends on the runqueue. When there is only a single task these
		two metrics are the same, but once there is contention for the CPU 'running'
		will decrease to reflect the fraction of time each task spends on the CPU
		while 'runnable' will increase to reflect the amount of contention.

		For more detail see: kernel/sched/pelt.c


		Frequency- / CPU Invariance
		---------------------------

		Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
		for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
		a big CPU, we allow architectures to scale the time delta with two ratios, one
		Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.

		For simple DVFS architectures (where software is in full control) we trivially
		compute the ratio as:

		f_cur
		r_dvfs := -----
		f_max

		For more dynamic systems where the hardware is in control of DVFS we use
		hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio.
		For Intel specifically, we use:

		APERF
		f_cur := ----- * P0
		MPERF

		4C-turbo; if available and turbo enabled
		f_max := { 1C-turbo; if turbo enabled
		P0; otherwise

		f_cur
		r_dvfs := min( 1, ----- )
		f_max

		We pick 4C turbo over 1C turbo to make it slightly more sustainable.

		r_cpu is determined as the ratio of highest performance level of the current
		CPU vs the highest performance level of any other CPU in the system.

		r_tot = r_dvfs * r_cpu

		The result is that the above 'running' and 'runnable' metrics become invariant
		of DVFS and CPU type. IOW. we can transfer and compare them between CPUs.

		For more detail see:

		- kernel/sched/pelt.h:update_rq_clock_pelt()
		- arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
		- Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"


		UTIL_EST / UTIL_EST_FASTUP
		--------------------------

		Because periodic tasks have their averages decayed while they sleep, even
		though when running their expected utilization will be the same, they suffer a
		(DVFS) ramp-up after they are running again.

		To alleviate this (a default enabled option) UTIL_EST drives an Infinite
		Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
		highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
		filter to instantly increase and only decay on decrease.

		A further runqueue wide sum (of runnable tasks) is maintained of:

		util_est := \Sum_t max( t_running, t_util_est_ewma )

		For more detail see: kernel/sched/fair.c:util_est_dequeue()


		UCLAMP
		------

		It is possible to set effective u_min and u_max clamps on each CFS or RT task;
		the runqueue keeps an max aggregate of these clamps for all running tasks.

		For more detail see: include/uapi/linux/sched/types.h


		Schedutil / DVFS
		----------------

		Every time the scheduler load tracking is updated (task wakeup, task
		migration, time progression) we call out to schedutil to update the hardware
		DVFS state.

		The basis is the CPU runqueue's 'running' metric, which per the above it is
		the frequency invariant utilization estimate of the CPU. From this we compute
		a desired frequency like:

		max( running, util_est ); if UTIL_EST
		u_cfs := { running; otherwise

		clamp( u_cfs + u_rt , u_min, u_max ); if UCLAMP_TASK
		u_clamp := { u_cfs + u_rt; otherwise

		u := u_clamp + u_irq + u_dl; [approx. see source for more detail]

		f_des := min( f_max, 1.25 u * f_max )

		XXX IO-wait; when the update is due to a task wakeup from IO-completion we
		boost 'u' above.

		This frequency is then used to select a P-state/OPP or directly munged into a
		CPPC style request to the hardware.

		XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
		required to satisfy the workload.

		Because these callbacks are directly from the scheduler, the DVFS hardware
		interaction should be 'fast' and non-blocking. Schedutil supports
		rate-limiting DVFS requests for when hardware interaction is slow and
		expensive, this reduces effectiveness.

		For more information see: kernel/sched/cpufreq_schedutil.c


		NOTES
		-----

		- On low-load scenarios, where DVFS is most relevant, the 'running' numbers
		will closely reflect utilization.

		- In saturated scenarios task movement will cause some transient dips,
		suppose we have a CPU saturated with 4 tasks, then when we migrate a task
		to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
		new CPU will gain 0.25. This is inevitable and time progression will
		correct this. XXX do we still guarantee f_max due to no idle-time?

		- Much of the above is about avoiding DVFS dips, and independent DVFS domains
		having to re-learn / ramp-up when load shifts.

arch/Kconfig

+9 −0

Original line number	Diff line number	Diff line
		@@ -1058,6 +1058,15 @@ config HAVE_STATIC_CALL_INLINE
		bool
		depends on HAVE_STATIC_CALL

		config HAVE_PREEMPT_DYNAMIC
		bool
		depends on HAVE_STATIC_CALL
		depends on GENERIC_ENTRY
		help
		Select this if the architecture support boot time preempt setting
		on top of static calls. It is strongly advised to support inline
		static call to avoid any overhead.

		config ARCH_WANT_LD_ORPHAN_WARN
		bool
		help

arch/powerpc/platforms/cell/spufs/sched.c

+1 −1

Original line number	Diff line number	Diff line
		@@ -72,7 +72,7 @@ static struct timer_list spuloadavg_timer;
		#define DEF_SPU_TIMESLICE (100 * HZ / (1000 * SPUSCHED_TICK))

		#define SCALE_PRIO(x, prio) \
		max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_SPU_TIMESLICE)
		max(x * (MAX_PRIO - prio) / (NICE_WIDTH / 2), MIN_SPU_TIMESLICE)

		/*
		* scale user-nice values [ -20 ... 0 ... 19 ] to time slice values:

arch/x86/Kconfig

+1 −0

Original line number	Diff line number	Diff line
		@@ -224,6 +224,7 @@ config X86
		select HAVE_STACK_VALIDATION if X86_64
		select HAVE_STATIC_CALL
		select HAVE_STATIC_CALL_INLINE if HAVE_STACK_VALIDATION
		select HAVE_PREEMPT_DYNAMIC
		select HAVE_RSEQ
		select HAVE_SYSCALL_TRACEPOINTS
		select HAVE_UNSTABLE_SCHED_CLOCK