Commit 47983efa authored Dec 13, 2024 by Zheng Zengkai

Align per_cpu osq_node to 64 Byte size cacheline

hulk inclusion
category: performance
bugzilla: https://gitee.com/openeuler/kernel/issues/IBAZIJ



--------------------------------

While tuning file copy performance of unixbench 5.1.3,
An obvious performance gap was found after an unrelated patch
("KVM: arm64: Exclude mdcr_el2_host from kvm_vcpu_arch") merged.
After some debug, it is confirmed that the different cacheline
alignments of the per_cpu variable osq_node will lead to different
performances:

System.map of good performance case:

ffff800081a50dc0 D runqueues
ffff800081a51e80 d qos_overload_timer
ffff800081a51f00 d qos_throttled_cfs_rq
ffff800081a51f40 d osq_node    <- osq_node is 64Byte aligned
ffff800081a51fc0 d qnodes
ffff800081a52040 d rcu_data

System.map of bad performance case:

ffff800081a51000 D runqueues
ffff800081a520c0 d qos_overload_timer
ffff800081a62140 d qos_throttled_cfs_rq
ffff800081a62180 d osq_node    <- osq_node is 128Byte aligned
ffff800081a62200 d qnodes
ffff800081a62280 d rcu_data

Adjust the previous per_cpu variable qos_throttled_cfs_rq to
128B cacheline aligned, then struct osq_node will be 64 Bype
cacheline aligned, achieving a better performance score of
file copy testcase.

Before this patch:
System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0 9014327065.8 772435.9
Double-Precision Whetstone                       55.0    1773200.2 322400.0
Execl Throughput                                 43.0      25330.3   5890.8
File Copy 1024 bufsize 2000 maxblocks          3960.0     500211.0   1263.2
File Copy 256 bufsize 500 maxblocks            1655.0     135793.0    820.5
File Copy 4096 bufsize 8000 maxblocks          5800.0    2033821.0   3506.6
Pipe Throughput                               12440.0  307115565.6 246877.5
Pipe-based Context Switching                   4000.0   26449665.0  66124.2
Process Creation                                126.0      67528.1   5359.4
Shell Scripts (1 concurrent)                     42.4     103709.4  24459.8
Shell Scripts (8 concurrent)                      6.0      13968.7  23281.2
System Call Overhead                          15000.0   14497214.3   9664.8
                                                                   ========
System Benchmarks Index Score                                       19236.3

After this patch:
System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0 9014326929.3 772435.9
Double-Precision Whetstone                       55.0    1768022.0 321458.5
Execl Throughput                                 43.0      25340.4   5893.1
File Copy 1024 bufsize 2000 maxblocks          3960.0     603479.0   1523.9
File Copy 256 bufsize 500 maxblocks            1655.0     150355.0    908.5
File Copy 4096 bufsize 8000 maxblocks          5800.0    2157456.0   3719.8
Pipe Throughput                               12440.0  298863938.1 240244.3
Pipe-based Context Switching                   4000.0   31548980.3  78872.5
Process Creation                                126.0      64479.9   5117.5
Shell Scripts (1 concurrent)                     42.4     108471.0  25582.8
Shell Scripts (8 concurrent)                      6.0      14539.2  24232.0
System Call Overhead                          15000.0   12485789.2   8323.9
                                                                   ========
System Benchmarks Index Score                                       19862.6

Note:
If the relative position of per_cpu variable qos_throttled_cfs_rq
and osq_node changed, this workaround should be adjusted as well.

Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>

parent aed150e2

kernel/sched/fair.c

+1 −1

Original line number	Diff line number	Diff line
		@@ -145,7 +145,7 @@ int __weak arch_asym_cpu_priority(int cpu)

		#ifdef CONFIG_QOS_SCHED

		static DEFINE_PER_CPU_SHARED_ALIGNED(struct list_head, qos_throttled_cfs_rq);
		static DEFINE_PER_CPU_SECTION(struct list_head, qos_throttled_cfs_rq, PER_CPU_SHARED_ALIGNED_SECTION) __attribute__((__aligned__(128)));
		static DEFINE_PER_CPU_SHARED_ALIGNED(struct hrtimer, qos_overload_timer);
		static DEFINE_PER_CPU(int, qos_cpu_overload);
		unsigned int sysctl_overload_detect_period = 5000; /* in ms */