sched: Describe CFS load-balancer (e9c84cb8) · Commits · EulixOS / Software / Kernel

kernel/sched/fair.c

+116 −2

Original line number	Diff line number	Diff line
		@@ -3456,7 +3456,121 @@ static bool yield_to_task_fair(struct rq rq, struct task_struct p, bool preemp

		#ifdef CONFIG_SMP
		/**************************************************
		* Fair scheduling class load-balancing methods:
		* Fair scheduling class load-balancing methods.
		*
		* BASICS
		*
		* The purpose of load-balancing is to achieve the same basic fairness the
		* per-cpu scheduler provides, namely provide a proportional amount of compute
		* time to each task. This is expressed in the following equation:
		*
		* W_i,n/P_i == W_j,n/P_j for all i,j (1)
		*
		* Where W_i,n is the n-th weight average for cpu i. The instantaneous weight
		* W_i,0 is defined as:
		*
		* W_i,0 = \Sum_j w_i,j (2)
		*
		* Where w_i,j is the weight of the j-th runnable task on cpu i. This weight
		* is derived from the nice value as per prio_to_weight[].
		*
		* The weight average is an exponential decay average of the instantaneous
		* weight:
		*
		* W'_i,n = (2^n - 1) / 2^n * W_i,n + 1 / 2^n * W_i,0 (3)
		*
		* P_i is the cpu power (or compute capacity) of cpu i, typically it is the
		* fraction of 'recent' time available for SCHED_OTHER task execution. But it
		* can also include other factors [XXX].
		*
		* To achieve this balance we define a measure of imbalance which follows
		* directly from (1):
		*
		* imb_i,j = max{ avg(W/P), W_i/P_i } - min{ avg(W/P), W_j/P_j } (4)
		*
		* We them move tasks around to minimize the imbalance. In the continuous
		* function space it is obvious this converges, in the discrete case we get
		* a few fun cases generally called infeasible weight scenarios.
		*
		* [XXX expand on:
		* - infeasible weights;
		* - local vs global optima in the discrete case. ]
		*
		*
		* SCHED DOMAINS
		*
		* In order to solve the imbalance equation (4), and avoid the obvious O(n^2)
		* for all i,j solution, we create a tree of cpus that follows the hardware
		* topology where each level pairs two lower groups (or better). This results
		* in O(log n) layers. Furthermore we reduce the number of cpus going up the
		* tree to only the first of the previous level and we decrease the frequency
		* of load-balance at each level inv. proportional to the number of cpus in
		* the groups.
		*
		* This yields:
		*
		* log_2 n 1 n
		* \Sum { --- * --- * 2^i } = O(n) (5)
		* i = 0 2^i 2^i
		* `- size of each group
		* \| \| `- number of cpus doing load-balance
		* \| `- freq
		* `- sum over all levels
		*
		* Coupled with a limit on how many tasks we can migrate every balance pass,
		* this makes (5) the runtime complexity of the balancer.
		*
		* An important property here is that each CPU is still (indirectly) connected
		* to every other cpu in at most O(log n) steps:
		*
		* The adjacency matrix of the resulting graph is given by:
		*
		* log_2 n
		* A_i,j = \Union (i % 2^k == 0) && i / 2^(k+1) == j / 2^(k+1) (6)
		* k = 0
		*
		* And you'll find that:
		*
		* A^(log_2 n)_i,j != 0 for all i,j (7)
		*
		* Showing there's indeed a path between every cpu in at most O(log n) steps.
		* The task movement gives a factor of O(m), giving a convergence complexity
		* of:
		*
		* O(nm log n), n := nr_cpus, m := nr_tasks (8)
		*
		*
		* WORK CONSERVING
		*
		* In order to avoid CPUs going idle while there's still work to do, new idle
		* balancing is more aggressive and has the newly idle cpu iterate up the domain
		* tree itself instead of relying on other CPUs to bring it work.
		*
		* This adds some complexity to both (5) and (8) but it reduces the total idle
		* time.
		*
		* [XXX more?]
		*
		*
		* CGROUPS
		*
		* Cgroups make a horror show out of (2), instead of a simple sum we get:
		*
		* s_k,i
		* W_i,0 = \Sum_j \Prod_k w_k * ----- (9)
		* S_k
		*
		* Where
		*
		* s_k,i = \Sum_j w_i,j,k and S_k = \Sum_i s_k,i (10)
		*
		* w_i,j,k is the weight of the j-th runnable task in the k-th cgroup on cpu i.
		*
		* The big problem is S_k, its a global sum needed to compute a local (W_i)
		* property.
		*
		* [XXX write more on how we solve this.. _after_ merging pjt's patches that
		* rewrite all of this once again.]
		*/

		static unsigned long __read_mostly max_load_balance_interval = HZ/10;