Commit c25cf09d authored by Zeng Heng's avatar Zeng Heng
Browse files

locking/osq_lock: Avoid false sharing in optimistic_spin_node

hulk inclusion
category: performance
bugzilla: https://gitee.com/openeuler/kernel/issues/I8MV01



--------------------------------

Using the UnixBench test suite, we clearly find that osq_lock() cause
extremely high overheads with perf tool in the File Copy items:

Overhead  Shared Object            Symbol
  94.25%  [kernel]                 [k] osq_lock
   0.74%  [kernel]                 [k] rwsem_spin_on_owner
   0.32%  [kernel]                 [k] filemap_get_read_batch

In response to this, we conducted an analysis and made some gains:

In the prologue of osq_lock(), it set `cpu` member of percpu struct
optimistic_spin_node with the local cpu id, after that the value of the
percpu struct would never change in fact. Based on that, we can regard
the `cpu` member as a constant variable.

In the meanwhile, other members of the percpu struct like next, prev and
locked are frequently modified by osq_lock() and osq_unlock() which are
called by rwsem, mutex and so on. However, that would invalidate the cache
of the cpu member on other CPUs.

Therefore, we can place padding here and split them into different cache
lines to avoid cache misses when the next CPU is spinning to check other
node's cpu member by vcpu_is_preempted().

Here provide the UnixBench full-core test result as below:
Machine Intel(R) Xeon(R) Gold 6248 CPU, 40 cores, 80 threads
Run the command of "./Run -c 80 -i 3" 10 times and take the average.

System Benchmarks Index Values           Without Patch   With Patch     Diff
Dhrystone 2 using register variables         185876.43    185945.41    0.04%
Double-Precision Whetstone                    79637.27     79659.29    0.03%
Execl Throughput                               9909.61     10576.06    6.73%
File Copy 1024 bufsize 2000 maxblocks          1723.01      2086.08   21.07%
File Copy 256 bufsize 500 maxblocks            1150.24      1338.21   16.34%
File Copy 4096 bufsize 8000 maxblocks          3719.19      4011.99    7.87%
Pipe Throughput                               66184.84     66025.25   -0.24%
Pipe-based Context Switching                  30606.18     31074.21    1.53%
Process Creation                               9442.48      9450.77    0.09%
Shell Scripts (1 concurrent)                  44526.52     46548.54    4.54%
Shell Scripts (8 concurrent)                  42903.96     45718.56    6.56%
System Call Overhead                           3645.20      3717.42    1.98%
                                                                    ========
System Benchmarks Index Score                 15126.87     15931.29    5.32%

Signed-off-by: default avatarZeng Heng <zengheng4@huawei.com>
parent db10ed58
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment