!169 sched/fair: Scan cluster before scanning LLC in wake-up path
Merge Pull Request from: @liujie-248683921
This is the follow-up work to support cluster scheduler. Previously we
have added cluster level in the scheduler for both ARM64[1] and X86[2] to
support load balance between clusters to bring more memory bandwidth and
decrease cache contention. This patchset, on the other hand, takes care of
wake-up path by giving CPUs within the same cluster a try before scanning
the whole LLC to benefit those tasks communicating with each other.
[1] bd0f49e67873 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
[2] 9d6b58524779 ("sched: Add cluster scheduler level for x86")
Barry Song (2):
sched/fair: Scan cluster before scanning LLC in wake-up path
sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API
In this PR, we also introduce a set of patches to make cluster scheduling
configurable and open the kernel configuration for cluster.
Tim Chen (3):
scheduler: Create SDTL_SKIP flag to skip topology level
scheduler: Add runtime knob sysctl_sched_cluster
scheduler: Add boot time enabling/disabling of cluster scheduling
Yicong Yang (1):
scheduler: Disable cluster scheduling by default
Jie Liu (1):
sched: Open the kernel configuration for cluster
The benchmark test is performed on the kunpeng920 with 96 CPUs and 4 numa.
The baseline is the kernel for which sched_cluster is not enabled.
Compared to the baseline, is the kernel with sched_cluster enabled.
The tbench test was performed on threads 1, 3, 6, 12, 24, 48, and 96 on
single numa, 2 numa, and 4 numa, respectively.
tbench results (node 0):
1: 239.9910(0.00%) 253.1110 ( 5.47%)
3: 720.6490(0.00%) 754.6980 ( 4.72%)
6: 1423.2633(0.00%) 1494.7767 ( 5.02%)
12: 2793.9700(0.00%) 2959.4333 ( 5.92%)
24: 4951.6967(0.00%) 4880.1233 ( -1.45%)
48: 4127.8067(0.00%) 4082.2900 ( -1.10%)
96: 3737.0067(0.00%) 3700.7133 ( -0.97%)
tbench results (node 0-1):
1: 241.8367(0.00%) 252.8930 ( 4.57%)
3: 719.8940(0.00%) 752.8933 ( 4.58%)
6: 1434.6667(0.00%) 1488.1167 ( 3.73%)
12: 2834.8233(0.00%) 2908.3833 ( 2.59%)
24: 5376.3200(0.00%) 5753.2133 ( 7.01%)
48: 9709.5933(0.00%) 9610.8667 ( -1.02%)
96: 8208.3500(0.00%) 8079.8200 ( -1.57%)
tbench results (node 0-3):
1: 248.7557(0.00%) 252.9087 ( 1.67%)
3: 730.8120(0.00%) 733.9887 ( 0.43%)
6: 1439.8500(0.00%) 1424.3133 ( -1.08%)
12: 2821.1333(0.00%) 2782.7300 ( -1.36%)
24: 5366.3633(0.00%) 5050.6467 ( -5.88%)
48: 9362.8867(0.00%) 9323.3033 ( -0.42%)
96: 12269.2900(0.00%) 15987.9000 ( 30.31%)
The netperf test was performed on threads 1, 3, 6, 12, 24, 48, and 96 on
single numa, 2 numa, and 4 numa, respectively.
netperf results TCP_RR (node 0):
1: 54557.1533(0.00%) 57056.0833 ( 4.58%)
3: 54073.4422(0.00%) 57013.9978 ( 5.44%)
6: 53158.2733(0.00%) 56904.6200 ( 7.05%)
12: 51908.7767(0.00%) 56452.1753 ( 8.75%)
24: 45868.0304(0.00%) 45737.4529 ( -0.28%)
48: 18372.7595(0.00%) 18353.7237 ( -0.10%)
96: 8298.8618(0.00%) 8276.4761 ( -0.27%)
netperf results TCP_RR (node 0-1):
1: 54645.5400(0.00%) 57017.2833 ( 4.34%)
3: 53852.1678(0.00%) 56886.0478 ( 5.63%)
6: 54196.2400(0.00%) 56772.8533 ( 4.75%)
12: 53221.2439(0.00%) 56683.0367 ( 6.50%)
24: 51334.2392(0.00%) 55881.7862 ( 8.86%)
48: 40452.4043(0.00%) 43306.8335 ( 7.06%)
96: 19012.9919(0.00%) 19051.5740 ( 0.20%)
netperf results TCP_RR (node 0-3):
1: 55933.2733(0.00%) 57134.6267 ( 2.15%)
3: 54865.2733(0.00%) 56848.4200 ( 3.61%)
6: 54131.9867(0.00%) 56813.5367 ( 4.95%)
12: 53226.0636(0.00%) 56336.5736 ( 5.84%)
24: 51632.2987(0.00%) 55689.8818 ( 7.86%)
48: 46864.5843(0.00%) 50361.5243 ( 7.46%)
96: 41761.4341(0.00%) 42939.8937 ( 2.82%)
netperf results UDP_RR (node 0):
1: 64038.8467(0.00%) 66604.0933 ( 4.01%)
3: 64253.2456(0.00%) 66948.3744 ( 4.19%)
6: 63617.2783(0.00%) 66944.2483 ( 5.23%)
12: 61060.0514(0.00%) 66565.0756 ( 9.02%)
24: 54961.9269(0.00%) 54935.7403 ( -0.05%)
48: 21988.5656(0.00%) 21964.7232 ( -0.11%)
96: 9808.7866(0.00%) 9806.4410 ( -0.02%)
netperf results UDP_RR (node 0-1):
1: 64101.6533(0.00%) 66924.6300 ( 4.40%)
3: 64058.6289(0.00%) 67014.3878 ( 4.61%)
6: 64000.8906(0.00%) 67007.2178 ( 4.70%)
12: 62794.1842(0.00%) 66901.7875 ( 6.54%)
24: 60655.0124(0.00%) 65935.2542 ( 8.71%)
48: 46036.2765(0.00%) 27424.1071 ( -40.43%)
96: 19524.9869(0.00%) 22832.9911 ( 16.94%)
netperf results UDP_RR (node 0-3):
1: 65459.8033(0.00%) 66813.4067 ( 2.07%)
3: 64308.0756(0.00%) 66555.0544 ( 3.49%)
6: 63501.9544(0.00%) 66384.6244 ( 4.54%)
12: 62764.5350(0.00%) 66237.2536 ( 5.53%)
24: 62048.7932(0.00%) 65946.3492 ( 6.28%)
48: 57374.9211(0.00%) 61945.5453 ( 7.97%)
96: 37389.7065(0.00%) 52512.2749 ( 40.45%)
The unixbench test of threads 6, 24, and 48 is performed on single numa, 2
numa, and 4 numa.
===== unixbench Dhrystone 2 using register variables =====
unixbench results (node 0):
6: 22394.8000(0.00%) 22424.7000 ( 0.13%)
24: 89510.0000(0.00%) 89514.0000 ( 0.00%)
48: 89713.0000(0.00%) 89748.1000 ( 0.04%)
unixbench results (node 0-1):
6: 22427.0000(0.00%) 22366.8000 ( -0.27%)
24: 89601.6000(0.00%) 89632.3000 ( 0.03%)
48: 179007.6000(0.00%) 178949.8000 ( -0.03%)
unixbench results (node 0-3):
6: 22403.7000(0.00%) 22419.9000 ( 0.07%)
24: 89566.7000(0.00%) 89541.2000 ( -0.03%)
48: 179065.0000(0.00%) 179055.2000 ( -0.01%)
===== unixbench Double-Precision Whetstone =====
unixbench results (node 0):
6: 4783.0000(0.00%) 4782.9000 ( -0.00%)
24: 19131.6000(0.00%) 19131.7000 ( 0.00%)
48: 38257.6000(0.00%) 38258.0000 ( 0.00%)
unixbench results (node 0-1):
6: 4782.9000(0.00%) 4782.9000 ( 0.00%)
24: 19131.6000(0.00%) 19131.8000 ( 0.00%)
48: 38263.1000(0.00%) 38263.1000 ( 0.00%)
unixbench results (node 0-3):
6: 4782.9000(0.00%) 4782.9000 ( 0.00%)
24: 19131.7000(0.00%) 19131.6000 ( -0.00%)
48: 38263.1000(0.00%) 38263.2000 ( 0.00%)
===== unixbench Execl Throughput =====
unixbench results (node 0):
6: 4013.2000(0.00%) 4209.5000 ( 4.89%)
24: 11262.1000(0.00%) 11223.5000 ( -0.34%)
48: 9748.9000(0.00%) 10940.7000 ( 12.22%)
unixbench results (node 0-1):
6: 3748.0000(0.00%) 3516.6000 ( -6.17%)
24: 10683.8000(0.00%) 9172.9000 ( -14.14%)
48: 10652.3000(0.00%) 10726.0000 ( 0.69%)
unixbench results (node 0-3):
6: 2918.5000(0.00%) 2904.0000 ( -0.50%)
24: 6647.2000(0.00%) 6730.9000 ( 1.26%)
48: 6243.6000(0.00%) 6209.5000 ( -0.55%)
===== unixbench File Copy 1024 bufsize 2000 maxblocks =====
unixbench results (node 0):
6: 3494.8000(0.00%) 3189.5000 ( -8.74%)
24: 3334.5000(0.00%) 3086.5000 ( -7.44%)
48: 2415.2000(0.00%) 2630.1000 ( 8.90%)
unixbench results (node 0-1):
6: 2357.7000(0.00%) 2693.8000 ( 14.26%)
24: 2779.9000(0.00%) 2705.6000 ( -2.67%)
48: 2409.6000(0.00%) 2367.2000 ( -1.76%)
unixbench results (node 0-3):
6: 1565.7000(0.00%) 1536.3000 ( -1.88%)
24: 1545.5000(0.00%) 1550.9000 ( 0.35%)
48: 1501.4000(0.00%) 1520.3000 ( 1.26%)
===== unixbench File Copy 256 bufsize 500 maxblocks =====
unixbench results (node 0):
6: 2355.0000(0.00%) 2129.7000 ( -9.57%)
24: 2075.1000(0.00%) 2028.6000 ( -2.24%)
48: 1719.0000(0.00%) 1717.3000 ( -0.10%)
unixbench results (node 0-1):
6: 1888.6000(0.00%) 1816.2000 ( -3.83%)
24: 1862.0000(0.00%) 1800.4000 ( -3.31%)
48: 1444.2000(0.00%) 1501.1000 ( 3.94%)
unixbench results (node 0-3):
6: 1113.8000(0.00%) 969.0000 ( -13.00%)
24: 984.4000(0.00%) 996.0000 ( 1.18%)
48: 946.0000(0.00%) 955.7000 ( 1.03%)
===== unixbench File Copy 4096 bufsize 8000 maxblocks =====
unixbench results (node 0):
6: 6048.9000(0.00%) 5567.4000 ( -7.96%)
24: 6343.4000(0.00%) 5674.2000 ( -10.55%)
48: 5040.7000(0.00%) 5241.9000 ( 3.99%)
unixbench results (node 0-1):
6: 5695.3000(0.00%) 5180.0000 ( -9.05%)
24: 5098.0000(0.00%) 4768.4000 ( -6.47%)
48: 4643.8000(0.00%) 4541.2000 ( -2.21%)
unixbench results (node 0-3):
6: 2992.6000(0.00%) 4231.4000 ( 41.40%)
24: 2926.5000(0.00%) 2853.9000 ( -2.48%)
48: 2718.4000(0.00%) 2703.7000 ( -0.54%)
===== unixbench Pipe Throughput =====
unixbench results (node 0):
6: 5819.5000(0.00%) 5845.3000 ( 0.44%)
24: 23273.3000(0.00%) 23314.9000 ( 0.18%)
48: 23316.0000(0.00%) 23323.7000 ( 0.03%)
unixbench results (node 0-1):
6: 5835.2000(0.00%) 5843.8000 ( 0.15%)
24: 23278.5000(0.00%) 23376.6000 ( 0.42%)
48: 46502.1000(0.00%) 46638.4000 ( 0.29%)
unixbench results (node 0-3):
6: 5827.9000(0.00%) 5843.2000 ( 0.26%)
24: 23304.2000(0.00%) 23328.7000 ( 0.11%)
48: 46608.1000(0.00%) 46665.3000 ( 0.12%)
===== unixbench Pipe-based Context Switching =====
unixbench results (node 0):
6: 2330.2000(0.00%) 2589.9000 ( 11.14%)
24: 10905.0000(0.00%) 10840.2000 ( -0.59%)
48: 8473.8000(0.00%) 8459.3000 ( -0.17%)
unixbench results (node 0-1):
6: 2424.4000(0.00%) 2574.2000 ( 6.18%)
24: 8457.5000(0.00%) 10015.3000 ( 18.42%)
48: 19092.4000(0.00%) 17770.4000 ( -6.92%)
unixbench results (node 0-3):
6: 2365.6000(0.00%) 2585.7000 ( 9.30%)
24: 9125.8000(0.00%) 10219.2000 ( 11.98%)
48: 10861.7000(0.00%) 10656.3000 ( -1.89%)
===== unixbench Process Creation =====
unixbench results (node 0):
6: 2541.7000(0.00%) 2642.1000 ( 3.95%)
24: 6289.2000(0.00%) 6303.7000 ( 0.23%)
48: 6726.1000(0.00%) 6618.9000 ( -1.59%)
unixbench results (node 0-1):
6: 2252.1000(0.00%) 2196.6000 ( -2.46%)
24: 5883.7000(0.00%) 5915.0000 ( 0.53%)
48: 7071.9000(0.00%) 7076.5000 ( 0.07%)
unixbench results (node 0-3):
6: 1684.1000(0.00%) 1769.6000 ( 5.08%)
24: 4107.7000(0.00%) 4123.8000 ( 0.39%)
48: 4453.4000(0.00%) 4371.0000 ( -1.85%)
===== unixbench Shell Scripts (1 concurrent) =====
unixbench results (node 0):
6: 8748.0000(0.00%) 8686.8000 ( -0.70%)
24: 20378.0000(0.00%) 20350.6000 ( -0.13%)
48: 20197.5000(0.00%) 20047.7000 ( -0.74%)
unixbench results (node 0-1):
6: 8265.6000(0.00%) 8115.8000 ( -1.81%)
24: 25387.6000(0.00%) 25443.6000 ( 0.22%)
48: 32417.7000(0.00%) 31579.4000 ( -2.59%)
unixbench results (node 0-3):
6: 6963.4000(0.00%) 6963.7000 ( 0.00%)
24: 20347.2000(0.00%) 20397.7000 ( 0.25%)
48: 23783.0000(0.00%) 23854.7000 ( 0.30%)
===== unixbench Shell Scripts (8 concurrent) =====
unixbench results (node 0):
6: 19852.4000(0.00%) 19829.3000 ( -0.12%)
24: 19548.3000(0.00%) 19434.7000 ( -0.58%)
48: 19321.0000(0.00%) 19366.3000 ( 0.23%)
unixbench results (node 0-1):
6: 24136.5000(0.00%) 23653.4000 ( -2.00%)
24: 31973.2000(0.00%) 31187.2000 ( -2.46%)
48: 31769.6000(0.00%) 30218.6000 ( -4.88%)
unixbench results (node 0-3):
6: 18668.3000(0.00%) 18696.6000 ( 0.15%)
24: 21164.6000(0.00%) 21599.2000 ( 2.05%)
48: 18580.7000(0.00%) 18964.8000 ( 2.07%)
===== unixbench System Call Overhead =====
unixbench results (node 0):
6: 2236.9000(0.00%) 2057.4000 ( -8.02%)
24: 2907.7000(0.00%) 2910.1000 ( 0.08%)
48: 2919.5000(0.00%) 2921.1000 ( 0.05%)
unixbench results (node 0-1):
6: 1106.3000(0.00%) 1016.8000 ( -8.09%)
24: 1186.4000(0.00%) 1178.5000 ( -0.67%)
48: 1215.7000(0.00%) 1212.5000 ( -0.26%)
unixbench results (node 0-3):
6: 1363.0000(0.00%) 1082.2000 ( -20.60%)
24: 1569.9000(0.00%) 1457.7000 ( -7.15%)
48: 1487.5000(0.00%) 1456.4000 ( -2.09%)
Perform the fio read test using bs 64k, iodepth 128, and numjobs 32 64 128.
BW (MiB/s):
32 1077(0.00%) 1130(4.92%)
64 1077(0.00%) 1077(0.00%)
128 1080(0.00%) 1080.6(0.05%)
IOPS (k):
32 17.2(0.00%) 17.2(0.00%)
64 17.2(0.00%) 17.2(0.00%)
128 17.3(0.00%) 17.3(0.00%)
The lmbench test with threads being 32 is performed and the memory
frequency is 2599 Mhz.
Processor, Processes - times in microseconds - smaller is better
null call 0.18(0.00%) 0.18( 0.00%)
null I/O 4.785(0.00%) 4.35( 9.09%)
stat 53.55(0.00%) 52.05( 2.80%)
open clos 104.5(0.00%) 105.5(-0.96%)
slct TCP 2.6425(0.00%) 2.645(-0.09%)
sig inst 0.2525(0.00%) 0.25( 0.99%)
sig hndl 2.285(0.00%) 2.2825( 0.11%)
fork proc 1359.75(0.00%) 1380(-1.49%)
exec proc 3625.25(0.00%) 3599.25( 0.72%)
sh proc 5292.5(0.00%) 5288.75( 0.07%)
Context switching - times in microseconds - smaller is better
2p/0K 2.7775(0.00%) 2.1575(22.32%)
2p/16K 2.9825(0.00%) 2.67(10.48%)
2p/64K 3.32(0.00%) 3.005( 9.49%)
8p/16K 8.2275(0.00%) 4.4025(46.49%)
8p/64K 10.1175(0.00%) 5.8775(41.91%)
16p/16K 8.92(0.00%) 5.8275(34.67%)
16p/64K 12.275(0.00%) 9.075(26.07%)
*Local* Communication latencies in microseconds - smaller is better
2p/0K 2.7775(0.00%) 2.1575(22.32%)
Pipe 6.98525(0.00%) 6.00875(13.98%)
AF UNIX 7.6475(0.00%) 6.955( 9.06%)
RPC/UDP 277.675(0.00%) 275.3( 0.86%)
TCP 19.825(0.00%) 18.575( 6.31%)
RPC/TCP 330.125(0.00%) 293.35(11.14%)
File & VM system latencies in microseconds - smaller is better
Mmap Latency 569.125(0.00%) 539.7( 5.17%)
Prot Fault 0.529667(0.00%) 0.38525(27.27%)
Page Fault 0.611225(0.00%) 0.6444(-5.43%)
100fd selct 1.16175(0.00%) 1.16275(-0.09%)
*Local* Communication bandwidths in MB/s - bigger is better
Pipe 71.25(0.00%) 76.5( 7.37%)
AF UNIX 78.75(0.00%) 79.25( 0.63%)
TCP 67.25(0.00%) 70.75( 5.20%)
File reread 87.15(0.00%) 96.625(10.87%)
Map reread 110.9(0.00%) 114.875( 3.58%)
Bcopy libc 49.75(0.00%) 53.525( 7.59%)
Bcopy hand 52.9(0.00%) 52.175(-1.37%)
Mem read 102.75(0.00%) 118.75(15.57%)
Mem write 44.175(0.00%) 51.25(16.02%)
Link:https://gitee.com/openeuler/kernel/pulls/169
Reviewed-by:
Liu Chao <liuchao173@huawei.com>
Reviewed-by:
Fred Kimmy <xweikong@163.com>
Signed-off-by:
Jialin Zhang <zhangjialin11@huawei.com>
Loading
Please sign in to comment