Unverified Commit 8bfaf346 authored May 10, 2023 by openeuler-ci-bot Committed by Gitee May 10, 2023
!169 sched/fair: Scan cluster before scanning LLC in wake-up path

Merge Pull Request from: @liujie-248683921 
 
This is the follow-up work to support cluster scheduler. Previously we
have added cluster level in the scheduler for both ARM64[1] and X86[2] to
support load balance between clusters to bring more memory bandwidth and
decrease cache contention. This patchset, on the other hand, takes care of
wake-up path by giving CPUs within the same cluster a try before scanning
the whole LLC to benefit those tasks communicating with each other.

[1] bd0f49e67873 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
[2] 9d6b58524779 ("sched: Add cluster scheduler level for x86")

Barry Song (2):
	sched/fair: Scan cluster before scanning LLC in wake-up path
	sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API

In this PR, we also introduce a set of patches to make cluster scheduling
configurable and open the kernel configuration for cluster. 

Tim Chen (3):
	scheduler: Create SDTL_SKIP flag to skip topology level
	scheduler: Add runtime knob sysctl_sched_cluster
	scheduler: Add boot time enabling/disabling of cluster scheduling

Yicong Yang (1):
	scheduler: Disable cluster scheduling by default

Jie Liu (1):
	sched: Open the kernel configuration for cluster

The benchmark test is performed on the kunpeng920 with 96 CPUs and 4 numa.
The baseline is the kernel for which sched_cluster is not enabled.
Compared to the baseline, is the kernel with sched_cluster enabled. 

The tbench test was performed on threads 1, 3, 6, 12, 24, 48, and 96 on
single numa, 2 numa, and 4 numa, respectively.

tbench results (node 0): 
  1:        239.9910(0.00%)        253.1110 (    5.47%)
  3:        720.6490(0.00%)        754.6980 (    4.72%)
  6:       1423.2633(0.00%)       1494.7767 (    5.02%)
 12:       2793.9700(0.00%)       2959.4333 (    5.92%)
 24:       4951.6967(0.00%)       4880.1233 (   -1.45%)
 48:       4127.8067(0.00%)       4082.2900 (   -1.10%)
 96:       3737.0067(0.00%)       3700.7133 (   -0.97%)

tbench results (node 0-1):
  1:        241.8367(0.00%)        252.8930 (    4.57%)
  3:        719.8940(0.00%)        752.8933 (    4.58%)
  6:       1434.6667(0.00%)       1488.1167 (    3.73%)
 12:       2834.8233(0.00%)       2908.3833 (    2.59%)
 24:       5376.3200(0.00%)       5753.2133 (    7.01%)
 48:       9709.5933(0.00%)       9610.8667 (   -1.02%)
 96:       8208.3500(0.00%)       8079.8200 (   -1.57%)

tbench results (node 0-3)：
  1:        248.7557(0.00%)        252.9087 (    1.67%)
  3:        730.8120(0.00%)        733.9887 (    0.43%)
  6:       1439.8500(0.00%)       1424.3133 (   -1.08%)
 12:       2821.1333(0.00%)       2782.7300 (   -1.36%)
 24:       5366.3633(0.00%)       5050.6467 (   -5.88%)
 48:       9362.8867(0.00%)       9323.3033 (   -0.42%)
 96:      12269.2900(0.00%)      15987.9000 (   30.31%)

The netperf test was performed on threads 1, 3, 6, 12, 24, 48, and 96 on
single numa, 2 numa, and 4 numa, respectively.

netperf results TCP_RR (node 0)：
  1:      54557.1533(0.00%)      57056.0833 (    4.58%)
  3:      54073.4422(0.00%)      57013.9978 (    5.44%)
  6:      53158.2733(0.00%)      56904.6200 (    7.05%)
 12:      51908.7767(0.00%)      56452.1753 (    8.75%)
 24:      45868.0304(0.00%)      45737.4529 (   -0.28%)
 48:      18372.7595(0.00%)      18353.7237 (   -0.10%)
 96:       8298.8618(0.00%)       8276.4761 (   -0.27%)

netperf results TCP_RR (node 0-1)：
  1:      54645.5400(0.00%)      57017.2833 (    4.34%)
  3:      53852.1678(0.00%)      56886.0478 (    5.63%)
  6:      54196.2400(0.00%)      56772.8533 (    4.75%)
 12:      53221.2439(0.00%)      56683.0367 (    6.50%)
 24:      51334.2392(0.00%)      55881.7862 (    8.86%)
 48:      40452.4043(0.00%)      43306.8335 (    7.06%)
 96:      19012.9919(0.00%)      19051.5740 (    0.20%)

netperf results TCP_RR (node 0-3)：
  1:      55933.2733(0.00%)      57134.6267 (    2.15%)
  3:      54865.2733(0.00%)      56848.4200 (    3.61%)
  6:      54131.9867(0.00%)      56813.5367 (    4.95%)
 12:      53226.0636(0.00%)      56336.5736 (    5.84%)
 24:      51632.2987(0.00%)      55689.8818 (    7.86%)
 48:      46864.5843(0.00%)      50361.5243 (    7.46%)
 96:      41761.4341(0.00%)      42939.8937 (    2.82%)

netperf results UDP_RR (node 0):
  1:      64038.8467(0.00%)      66604.0933 (    4.01%)
  3:      64253.2456(0.00%)      66948.3744 (    4.19%)
  6:      63617.2783(0.00%)      66944.2483 (    5.23%)
 12:      61060.0514(0.00%)      66565.0756 (    9.02%)
 24:      54961.9269(0.00%)      54935.7403 (   -0.05%)
 48:      21988.5656(0.00%)      21964.7232 (   -0.11%)
 96:       9808.7866(0.00%)       9806.4410 (   -0.02%)

netperf results UDP_RR (node 0-1):
  1:      64101.6533(0.00%)      66924.6300 (    4.40%)
  3:      64058.6289(0.00%)      67014.3878 (    4.61%)
  6:      64000.8906(0.00%)      67007.2178 (    4.70%)
 12:      62794.1842(0.00%)      66901.7875 (    6.54%)
 24:      60655.0124(0.00%)      65935.2542 (    8.71%)
 48:      46036.2765(0.00%)      27424.1071 (  -40.43%)
 96:      19524.9869(0.00%)      22832.9911 (   16.94%)

netperf results UDP_RR (node 0-3):
  1:      65459.8033(0.00%)      66813.4067 (    2.07%)
  3:      64308.0756(0.00%)      66555.0544 (    3.49%)
  6:      63501.9544(0.00%)      66384.6244 (    4.54%)
 12:      62764.5350(0.00%)      66237.2536 (    5.53%)
 24:      62048.7932(0.00%)      65946.3492 (    6.28%)
 48:      57374.9211(0.00%)      61945.5453 (    7.97%)
 96:      37389.7065(0.00%)      52512.2749 (   40.45%)

The unixbench test of threads 6, 24, and 48 is performed on single numa, 2
numa, and 4 numa.

===== unixbench Dhrystone 2 using register variables =====
unixbench results (node 0):
  6:      22394.8000(0.00%)      22424.7000 (    0.13%)
 24:      89510.0000(0.00%)      89514.0000 (    0.00%)
 48:      89713.0000(0.00%)      89748.1000 (    0.04%)
unixbench results (node 0-1):
  6:      22427.0000(0.00%)      22366.8000 (   -0.27%)
 24:      89601.6000(0.00%)      89632.3000 (    0.03%)
 48:     179007.6000(0.00%)     178949.8000 (   -0.03%)
unixbench results (node 0-3):
  6:      22403.7000(0.00%)      22419.9000 (    0.07%)
 24:      89566.7000(0.00%)      89541.2000 (   -0.03%)
 48:     179065.0000(0.00%)     179055.2000 (   -0.01%)

===== unixbench Double-Precision Whetstone =====
unixbench results (node 0):
  6:       4783.0000(0.00%)       4782.9000 (   -0.00%)
 24:      19131.6000(0.00%)      19131.7000 (    0.00%)
 48:      38257.6000(0.00%)      38258.0000 (    0.00%)
unixbench results (node 0-1):
  6:       4782.9000(0.00%)       4782.9000 (    0.00%)
 24:      19131.6000(0.00%)      19131.8000 (    0.00%)
 48:      38263.1000(0.00%)     38263.1000 (    0.00%)
unixbench results (node 0-3):
  6:       4782.9000(0.00%)       4782.9000 (    0.00%)
 24:      19131.7000(0.00%)      19131.6000 (   -0.00%)
 48:      38263.1000(0.00%)      38263.2000 (    0.00%)

===== unixbench Execl Throughput =====
unixbench results (node 0):
  6:       4013.2000(0.00%)       4209.5000 (    4.89%)
 24:      11262.1000(0.00%)      11223.5000 (   -0.34%)
 48:       9748.9000(0.00%)      10940.7000 (   12.22%)
unixbench results (node 0-1):
  6:       3748.0000(0.00%)       3516.6000 (   -6.17%)
 24:      10683.8000(0.00%)       9172.9000 (  -14.14%)
 48:      10652.3000(0.00%)     10726.0000 (    0.69%)
unixbench results (node 0-3):
  6:       2918.5000(0.00%)       2904.0000 (   -0.50%)
 24:       6647.2000(0.00%)       6730.9000 (    1.26%)
 48:       6243.6000(0.00%)       6209.5000 (   -0.55%)

===== unixbench File Copy 1024 bufsize 2000 maxblocks =====
unixbench results (node 0):
  6:       3494.8000(0.00%)       3189.5000 (   -8.74%)
 24:       3334.5000(0.00%)       3086.5000 (   -7.44%)
 48:       2415.2000(0.00%)       2630.1000 (    8.90%)
unixbench results (node 0-1):
  6:       2357.7000(0.00%)       2693.8000 (   14.26%)
 24:       2779.9000(0.00%)      2705.6000 (   -2.67%)
 48:       2409.6000(0.00%)      2367.2000 (   -1.76%)
unixbench results (node 0-3):
  6:       1565.7000(0.00%)       1536.3000 (   -1.88%)
 24:       1545.5000(0.00%)       1550.9000 (    0.35%)
 48:       1501.4000(0.00%)       1520.3000 (    1.26%)

===== unixbench File Copy 256 bufsize 500 maxblocks =====
unixbench results (node 0):
  6:       2355.0000(0.00%)       2129.7000 (   -9.57%)
 24:       2075.1000(0.00%)       2028.6000 (   -2.24%)
 48:       1719.0000(0.00%)       1717.3000 (   -0.10%)
unixbench results (node 0-1):
  6:       1888.6000(0.00%)       1816.2000 (   -3.83%)
 24:       1862.0000(0.00%)       1800.4000 (   -3.31%)
 48:       1444.2000(0.00%)       1501.1000 (    3.94%)
unixbench results (node 0-3):
  6:       1113.8000(0.00%)        969.0000 (  -13.00%)
 24:        984.4000(0.00%)        996.0000 (    1.18%)
 48:        946.0000(0.00%)        955.7000 (    1.03%)

===== unixbench File Copy 4096 bufsize 8000 maxblocks =====
unixbench results (node 0):
  6:       6048.9000(0.00%)       5567.4000 (   -7.96%)
 24:       6343.4000(0.00%)       5674.2000 (  -10.55%)
 48:       5040.7000(0.00%)       5241.9000 (    3.99%)
unixbench results (node 0-1):
  6:       5695.3000(0.00%)       5180.0000 (   -9.05%)
 24:       5098.0000(0.00%)       4768.4000 (   -6.47%)
 48:       4643.8000(0.00%)       4541.2000 (   -2.21%)
unixbench results (node 0-3):
  6:       2992.6000(0.00%)       4231.4000 (   41.40%)
 24:       2926.5000(0.00%)       2853.9000 (   -2.48%)
 48:       2718.4000(0.00%)       2703.7000 (   -0.54%)

===== unixbench Pipe Throughput =====
unixbench results (node 0):
  6:       5819.5000(0.00%)       5845.3000 (    0.44%)
 24:      23273.3000(0.00%)      23314.9000 (    0.18%)
 48:      23316.0000(0.00%)      23323.7000 (    0.03%)
unixbench results (node 0-1):
  6:       5835.2000(0.00%)       5843.8000 (    0.15%)
 24:      23278.5000(0.00%)      23376.6000 (    0.42%)
 48:      46502.1000(0.00%)      46638.4000 (    0.29%)
unixbench results (node 0-3):
  6:       5827.9000(0.00%)       5843.2000 (    0.26%)
 24:      23304.2000(0.00%)      23328.7000 (    0.11%)
 48:      46608.1000(0.00%)      46665.3000 (    0.12%)

===== unixbench Pipe-based Context Switching =====
unixbench results (node 0):
  6:       2330.2000(0.00%)       2589.9000 (   11.14%)
 24:      10905.0000(0.00%)      10840.2000 (   -0.59%)
 48:       8473.8000(0.00%)       8459.3000 (   -0.17%)
unixbench results (node 0-1):
  6:       2424.4000(0.00%)       2574.2000 (    6.18%)
 24:       8457.5000(0.00%)      10015.3000 (   18.42%)
 48:      19092.4000(0.00%)      17770.4000 (   -6.92%)
unixbench results (node 0-3):
  6:       2365.6000(0.00%)       2585.7000 (    9.30%)
 24:       9125.8000(0.00%)      10219.2000 (   11.98%)
 48:      10861.7000(0.00%)      10656.3000 (   -1.89%)

===== unixbench Process Creation =====
unixbench results (node 0):
  6:       2541.7000(0.00%)       2642.1000 (    3.95%)
 24:       6289.2000(0.00%)       6303.7000 (    0.23%)
 48:       6726.1000(0.00%)       6618.9000 (   -1.59%)
unixbench results (node 0-1):
  6:       2252.1000(0.00%)       2196.6000 (   -2.46%)
 24:       5883.7000(0.00%)       5915.0000 (    0.53%)
 48:       7071.9000(0.00%)       7076.5000 (    0.07%)
unixbench results (node 0-3):
  6:       1684.1000(0.00%)       1769.6000 (    5.08%)
 24:       4107.7000(0.00%)       4123.8000 (    0.39%)
 48:       4453.4000(0.00%)       4371.0000 (   -1.85%)

===== unixbench Shell Scripts (1 concurrent) =====
unixbench results (node 0):
  6:       8748.0000(0.00%)       8686.8000 (   -0.70%)
 24:      20378.0000(0.00%)      20350.6000 (   -0.13%)
 48:      20197.5000(0.00%)      20047.7000 (   -0.74%)
unixbench results (node 0-1):
  6:       8265.6000(0.00%)       8115.8000 (   -1.81%)
 24:      25387.6000(0.00%)      25443.6000 (    0.22%)
 48:      32417.7000(0.00%)      31579.4000 (   -2.59%)
unixbench results (node 0-3):
  6:       6963.4000(0.00%)       6963.7000 (    0.00%)
 24:      20347.2000(0.00%)      20397.7000 (    0.25%)
 48:      23783.0000(0.00%)      23854.7000 (    0.30%)

===== unixbench Shell Scripts (8 concurrent) =====
unixbench results (node 0):
  6:      19852.4000(0.00%)      19829.3000 (   -0.12%)
 24:      19548.3000(0.00%)      19434.7000 (   -0.58%)
 48:      19321.0000(0.00%)      19366.3000 (    0.23%)
unixbench results (node 0-1):
  6:      24136.5000(0.00%)      23653.4000 (   -2.00%)
 24:      31973.2000(0.00%)      31187.2000 (   -2.46%)
 48:      31769.6000(0.00%)      30218.6000 (   -4.88%)
unixbench results (node 0-3):
  6:      18668.3000(0.00%)      18696.6000 (    0.15%)
 24:      21164.6000(0.00%)      21599.2000 (    2.05%)
 48:      18580.7000(0.00%)      18964.8000 (    2.07%)

===== unixbench System Call Overhead =====
unixbench results (node 0):
  6:       2236.9000(0.00%)       2057.4000 (   -8.02%)
 24:       2907.7000(0.00%)       2910.1000 (    0.08%)
 48:       2919.5000(0.00%)       2921.1000 (    0.05%)
unixbench results (node 0-1):
  6:       1106.3000(0.00%)       1016.8000 (   -8.09%)
 24:       1186.4000(0.00%)       1178.5000 (   -0.67%)
 48:       1215.7000(0.00%)       1212.5000 (   -0.26%)
unixbench results (node 0-3):
  6:       1363.0000(0.00%)       1082.2000 (  -20.60%)
 24:       1569.9000(0.00%)       1457.7000 (   -7.15%)
 48:       1487.5000(0.00%)       1456.4000 (   -2.09%)

Perform the fio read test using bs 64k, iodepth 128, and numjobs 32 64 128.

BW (MiB/s):
32	1077(0.00%)	1130(4.92%)
64	1077(0.00%)	1077(0.00%)
128	1080(0.00%)   1080.6(0.05%)
IOPS (k):
32	17.2(0.00%)	17.2(0.00%)
64	17.2(0.00%)	17.2(0.00%)
128	17.3(0.00%)	17.3(0.00%)

The lmbench test with threads being 32 is performed and the memory
frequency is 2599 Mhz.

Processor, Processes - times in microseconds - smaller is better
null call	   0.18(0.00%)	   0.18( 0.00%)
null I/O	  4.785(0.00%)	   4.35( 9.09%)
stat		  53.55(0.00%)	  52.05( 2.80%)
open clos	  104.5(0.00%)	  105.5(-0.96%)
slct TCP	 2.6425(0.00%)	  2.645(-0.09%)
sig inst	 0.2525(0.00%)	   0.25( 0.99%)
sig hndl	  2.285(0.00%)	 2.2825( 0.11%)
fork proc	1359.75(0.00%)	   1380(-1.49%)
exec proc	3625.25(0.00%)	3599.25( 0.72%)
sh proc		 5292.5(0.00%)	5288.75( 0.07%)

Context switching - times in microseconds - smaller is better
2p/0K		 2.7775(0.00%)	2.1575(22.32%)
2p/16K		 2.9825(0.00%)    2.67(10.48%)
2p/64K		   3.32(0.00%)	 3.005( 9.49%)
8p/16K		 8.2275(0.00%)	4.4025(46.49%)
8p/64K		10.1175(0.00%)	5.8775(41.91%)
16p/16K		   8.92(0.00%)	5.8275(34.67%)
16p/64K		 12.275(0.00%)	 9.075(26.07%)

*Local* Communication latencies in microseconds - smaller is better
2p/0K		 2.7775(0.00%)	 2.1575(22.32%)
Pipe		6.98525(0.00%)	6.00875(13.98%)
AF UNIX		 7.6475(0.00%)	  6.955( 9.06%)
RPC/UDP		277.675(0.00%)	  275.3( 0.86%)
TCP		 19.825(0.00%)	 18.575( 6.31%)
RPC/TCP		330.125(0.00%)	 293.35(11.14%)

File & VM system latencies in microseconds - smaller is better
Mmap Latency	 569.125(0.00%)	  539.7( 5.17%)
Prot Fault	0.529667(0.00%)	0.38525(27.27%)
Page Fault	0.611225(0.00%)	 0.6444(-5.43%)
100fd selct	 1.16175(0.00%)	1.16275(-0.09%)

*Local* Communication bandwidths in MB/s - bigger is better
Pipe		 71.25(0.00%)	   76.5( 7.37%)
AF UNIX		 78.75(0.00%)	  79.25( 0.63%)
TCP		 67.25(0.00%)	  70.75( 5.20%)
File reread	 87.15(0.00%)	 96.625(10.87%)
Map reread	 110.9(0.00%)	114.875( 3.58%)
Bcopy libc	 49.75(0.00%)	 53.525( 7.59%)
Bcopy hand	  52.9(0.00%)	 52.175(-1.37%)
Mem read	102.75(0.00%)	 118.75(15.57%)
Mem write	44.175(0.00%)	  51.25(16.02%) 
 
Link:https://gitee.com/openeuler/kernel/pulls/169

 

Reviewed-by: Liu Chao <liuchao173@huawei.com>
Reviewed-by: Fred Kimmy <xweikong@163.com>
Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com>
parents c5e14e78 aff64936
Show whitespace changes
Inline Side-by-side
Please to comment