Skip to content
  1. Aug 05, 2023
    • Guo Ren's avatar
      locking/qspinlock: Move pv_ops into x86 directory · e1a08232
      Guo Ren authored
      
      
      The pv_ops belongs to x86 custom infrastructure and cleans up the
      cna_configure_spin_lock_slowpath() with standard code. This is
      preparation for riscv support CNA qspoinlock.
      
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      e1a08232
    • Guo Ren's avatar
      RISC-V: paravirt: pvqspinlock: Add trace point for pv_kick/wait · 747bc4d5
      Guo Ren authored
      
      
      Add trace point for pv_kick/wait, here is the output:
      
       entries-in-buffer/entries-written: 33927/33927   #P:12
      
                                      _-----=> irqs-off/BH-disabled
                                     / _----=> need-resched
                                    | / _---=> hardirq/softirq
                                    || / _--=> preempt-depth
                                    ||| / _-=> migrate-disable
                                    |||| /     delay
                 TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
                    | |         |   |||||     |         |
                   sh-100     [001] d..2.    28.312294: pv_wait: cpu 1 out of wfi
               <idle>-0       [000] d.h4.    28.322030: pv_kick: cpu 0 kick target cpu 1
                   sh-100     [001] d..2.    30.982631: pv_wait: cpu 1 out of wfi
               <idle>-0       [000] d.h4.    30.993289: pv_kick: cpu 0 kick target cpu 1
                   sh-100     [002] d..2.    44.987573: pv_wait: cpu 2 out of wfi
               <idle>-0       [000] d.h4.    44.989000: pv_kick: cpu 0 kick target cpu 2
               <idle>-0       [003] d.s3.    51.593978: pv_kick: cpu 3 kick target cpu 4
            rcu_sched-15      [004] d..2.    51.595192: pv_wait: cpu 4 out of wfi
      lock_torture_wr-115     [004] ...2.    52.656482: pv_kick: cpu 4 kick target cpu 2
      lock_torture_wr-113     [002] d..2.    52.659146: pv_wait: cpu 2 out of wfi
      lock_torture_wr-114     [008] d..2.    52.659507: pv_wait: cpu 8 out of wfi
      lock_torture_wr-114     [008] d..2.    52.663503: pv_wait: cpu 8 out of wfi
      lock_torture_wr-113     [002] ...2.    52.666128: pv_kick: cpu 2 kick target cpu 8
      lock_torture_wr-114     [008] d..2.    52.667261: pv_wait: cpu 8 out of wfi
      lock_torture_wr-114     [009] .n.2.    53.141515: pv_kick: cpu 9 kick target cpu 11
      lock_torture_wr-113     [002] d..2.    53.143339: pv_wait: cpu 2 out of wfi
      lock_torture_wr-116     [007] d..2.    53.143412: pv_wait: cpu 7 out of wfi
      lock_torture_wr-118     [000] d..2.    53.143457: pv_wait: cpu 0 out of wfi
      lock_torture_wr-115     [008] d..2.    53.143481: pv_wait: cpu 8 out of wfi
      lock_torture_wr-117     [011] d..2.    53.143522: pv_wait: cpu 11 out of wfi
      lock_torture_wr-117     [011] ...2.    53.143987: pv_kick: cpu 11 kick target cpu 8
      lock_torture_wr-115     [008] ...2.    53.144269: pv_kick: cpu 8 kick target cpu 7
      
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      747bc4d5
    • Guo Ren's avatar
      RISC-V: paravirt: pvqspinlock: Add kconfig entry · acc1f81b
      Guo Ren authored
      
      
      Add kconfig entry for paravirt_spinlock, an unfair qspinlock
      virtualization-friendly backend, by halting the virtual CPU rather
      than spinning.
      
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      acc1f81b
    • Guo Ren's avatar
      RISC-V: paravirt: pvqspinlock: Add SBI implementation · e9163f36
      Guo Ren authored
      
      
      Implement pv_kick with SBI implementation, and add SBI_EXT_PVLOCK
      extension detection.
      
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      e9163f36
    • Guo Ren's avatar
      RISC-V: paravirt: pvqspinlock: Add nopvspin kernel parameter · 9bdf4d65
      Guo Ren authored
      
      
      Disables the qspinlock slow path using PV optimizations which
      allow the hypervisor to 'idle' the guest on lock contention.
      
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      9bdf4d65
    • Guo Ren's avatar
      RISC-V: paravirt: pvqspinlock: KVM: Implement kvm_sbi_ext_pvlock_kick_cpu() · 4c6c3342
      Guo Ren authored
      
      
      We only need to call the kvm_vcpu_kick() and bring target_vcpu
      from the halt state. No irq raised, no other request, just a pure
      vcpu_kick.
      
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      4c6c3342
    • Guo Ren's avatar
      RISC-V: paravirt: pvqspinlock: KVM: Add paravirt qspinlock skeleton · c328a9f5
      Guo Ren authored
      
      
      Add the files functions needed to support the SBI PVLOCK (paravirt
      qspinlock kick_cpu) extension. This is a preparation for the next
      core implementation of kick_cpu.
      
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      c328a9f5
    • Guo Ren's avatar
      RISC-V: paravirt: pvqspinlock: Add paravirt qspinlock skeleton · fd5361d7
      Guo Ren authored
      
      
      Using static_call to switch between:
        native_queued_spin_lock_slowpath()    __pv_queued_spin_lock_slowpath()
        native_queued_spin_unlock()           __pv_queued_spin_unlock()
      
      Finish the pv_wait implementation, but pv_kick needs the SBI
      definition of the next patches.
      
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      fd5361d7
    • Guo Ren's avatar
      riscv: qspinlock: Use new static key for controlling call of virt_spin_lock() · 226dde60
      Guo Ren authored
      
      
      Add a static key controlling whether virt_spin_lock() should be
      called or not. When running on bare metal set the new key to
      false.
      
      The KVM guests fall back to a Test-and-Set spinlock, because fair
      locks have horrible lock 'holder' preemption issues. The
      virt_spin_lock_key would shortcut for the
      queued_spin_lock_slowpath() function that allow virt_spin_lock to
      hijack it.
      
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      226dde60
    • Guo Ren's avatar
      riscv: qspinlock: errata: Enable qspinlock for T-HEAD processors · b922c42a
      Guo Ren authored
      
      
      According to qspinlock requirements, RISC-V gives out a weak LR/SC
      forward progress guarantee which does not satisfy qspinlock. But
      many vendors could produce stronger forward guarantee LR/SC to
      ensure the xchg_tail could be finished in time on any kind of
      hart. T-HEAD is the vendor which implements strong forward
      guarantee LR/SC instruction pairs, so enable qspinlock for T-HEAD
      with errata init help.
      
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      b922c42a
    • Guo Ren's avatar
      riscv: qspinlock: Allow force qspinlock from the command line · e5aa0b56
      Guo Ren authored
      
      
      Allow cmdline to force the kernel to use queued_spinlock when
      CONFIG_RISCV_COMBO_SPINLOCKS=y.
      
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      e5aa0b56
    • Guo Ren's avatar
      riscv: qspinlock: Introduce combo spinlock · be425281
      Guo Ren authored
      
      
      Combo spinlock could support queued and ticket in one Linux Image and
      select them during boot time via errata mechanism. Here is the func
      size (Bytes) comparison table below:
      
      TYPE			: COMBO | TICKET | QUEUED
      arch_spin_lock		: 106	| 60     | 50
      arch_spin_unlock	: 54    | 36     | 26
      arch_spin_trylock	: 110   | 72     | 54
      arch_spin_is_locked	: 48    | 34     | 20
      arch_spin_is_contended	: 56    | 40     | 24
      rch_spin_value_unlocked	: 48    | 34     | 24
      
      One example of disassemble combo arch_spin_unlock:
         0xffffffff8000409c <+14>:    nop                # detour slot
         0xffffffff800040a0 <+18>:    fence   rw,w       # queued spinlock start
         0xffffffff800040a4 <+22>:    sb      zero,0(a4) # queued spinlock end
         0xffffffff800040a8 <+26>:    ld      s0,8(sp)
         0xffffffff800040aa <+28>:    addi    sp,sp,16
         0xffffffff800040ac <+30>:    ret
         0xffffffff800040ae <+32>:    lw      a5,0(a4)   # ticket spinlock start
         0xffffffff800040b0 <+34>:    sext.w  a5,a5
         0xffffffff800040b2 <+36>:    fence   rw,w
         0xffffffff800040b6 <+40>:    addiw   a5,a5,1
         0xffffffff800040b8 <+42>:    slli    a5,a5,0x30
         0xffffffff800040ba <+44>:    srli    a5,a5,0x30
         0xffffffff800040bc <+46>:    sh      a5,0(a4)   # ticket spinlock end
         0xffffffff800040c0 <+50>:    ld      s0,8(sp)
         0xffffffff800040c2 <+52>:    addi    sp,sp,16
         0xffffffff800040c4 <+54>:    ret
      
      The qspinlock is smaller and faster than ticket-lock when all are in
      fast-path, and combo spinlock could provide a compatible Linux Image
      for different micro-arch design (weak/strict fwd guarantee LR/SC)
      processors.
      
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      be425281
    • Guo Ren's avatar
      riscv: qspinlock: Add basic queued_spinlock support · 6e0a5882
      Guo Ren authored
      The requirements of qspinlock have been documented by commit:
      a8ad07e5 ("asm-generic: qspinlock: Indicate the use of mixed-size
      atomics").
      
      Although RISC-V ISA gives out a weaker forward guarantee LR/SC, which
      doesn't satisfy the requirements of qspinlock above, it won't prevent
      some riscv vendors from implementing a strong fwd guarantee LR/SC in
      microarchitecture to match xchg_tail requirement. T-HEAD C9xx processor
      is the one.
      
      We've tested the patch on SOPHGO sg2042 & th1520 and passed the stress
      test on Fedora & Ubuntu & OpenEuler ... Here is the performance
      comparison between qspinlock and ticket_lock on sg2042 (64 cores):
      
      sysbench test=threads threads=32 yields=100 lock=8 (+13.8%):
        queued_spinlock 0.5109/0.00
        ticket_spinlock 0.5814/0.00
      
      perf futex/hash (+6.7%):
        queued_spinlock 14443937
      
       operations/sec (+- 0.09%)
        ticket_spinlock 1353215 operations/sec (+- 0.15%)
      
      perf futex/wake-parallel (+8.6%):
        queued_spinlock (waking 1/64 threads) in 0.0253 ms (+-2.90%)
        ticket_spinlock (waking 1/64 threads) in 0.0275 ms (+-3.12%)
      
      perf futex/requeue (+4.2%):
        queued_spinlock Requeued 64 of 64 threads in 0.0785 ms (+-0.55%)
        ticket_spinlock Requeued 64 of 64 threads in 0.0818 ms (+-4.12%)
      
      System Benchmarks (+6.4%)
        queued_spinlock:
          System Benchmarks Index Values               BASELINE       RESULT    INDEX
          Dhrystone 2 using register variables         116700.0  628613745.4  53865.8
          Double-Precision Whetstone                       55.0     182422.8  33167.8
          Execl Throughput                                 43.0      13116.6   3050.4
          File Copy 1024 bufsize 2000 maxblocks          3960.0    7762306.2  19601.8
          File Copy 256 bufsize 500 maxblocks            1655.0    3417556.8  20649.9
          File Copy 4096 bufsize 8000 maxblocks          5800.0    7427995.7  12806.9
          Pipe Throughput                               12440.0   23058600.5  18535.9
          Pipe-based Context Switching                   4000.0    2835617.7   7089.0
          Process Creation                                126.0      12537.3    995.0
          Shell Scripts (1 concurrent)                     42.4      57057.4  13456.9
          Shell Scripts (8 concurrent)                      6.0       7367.1  12278.5
          System Call Overhead                          15000.0   33308301.3  22205.5
                                                                             ========
          System Benchmarks Index Score                                       12426.1
      
        ticket_spinlock:
          System Benchmarks Index Values               BASELINE       RESULT    INDEX
          Dhrystone 2 using register variables         116700.0  626541701.9  53688.2
          Double-Precision Whetstone                       55.0     181921.0  33076.5
          Execl Throughput                                 43.0      12625.1   2936.1
          File Copy 1024 bufsize 2000 maxblocks          3960.0    6553792.9  16550.0
          File Copy 256 bufsize 500 maxblocks            1655.0    3189231.6  19270.3
          File Copy 4096 bufsize 8000 maxblocks          5800.0    7221277.0  12450.5
          Pipe Throughput                               12440.0   20594018.7  16554.7
          Pipe-based Context Switching                   4000.0    2571117.7   6427.8
          Process Creation                                126.0      10798.4    857.0
          Shell Scripts (1 concurrent)                     42.4      57227.5  13497.1
          Shell Scripts (8 concurrent)                      6.0       7329.2  12215.3
          System Call Overhead                          15000.0   30766778.4  20511.2
                                                                             ========
          System Benchmarks Index Score                                       11670.7
      
      The qspinlock has a significant improvement on SOPHGO SG2042 64
      cores platform than the ticket_lock.
      
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      6e0a5882
    • Guo Ren's avatar
      riscv: qspinlock: errata: Add ERRATA_THEAD_WRITE_ONCE fixup · 1af84b48
      Guo Ren authored
      
      
      The early version of T-Head C9xx cores has a store merge buffer
      delay problem. The store merge buffer could improve the store queue
      performance by merging multi-store requests, but when there are not
      continued store requests, the prior single store request would be
      waiting in the store queue for a long time. That would cause
      significant problems for communication between multi-cores. This
      problem was found on sg2042 & th1520 platforms with the qspinlock
      lock torture test.
      
      So appending a fence w.o could immediately flush the store merge
      buffer and let other cores see the write result.
      
      This will apply the WRITE_ONCE errata to handle the non-standard
      behavior via appending a fence w.o instruction for WRITE_ONCE().
      
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      1af84b48
    • Guo Ren's avatar
      asm-generic: ticket-lock: Move into ticket_spinlock.h · 28794cd9
      Guo Ren authored
      
      
      Move ticket-lock definition into an independent file. This is the
      preparation for the next combo spinlock of riscv.
      
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      28794cd9
    • Guo Ren's avatar
      asm-generic: ticket-lock: Reuse arch_spinlock_t of qspinlock · da4c87a0
      Guo Ren authored
      
      
      The arch_spinlock_t of qspinlock has contained the atomic_t val, which
      satisfies the ticket-lock requirement. Thus, unify the arch_spinlock_t
      into qspinlock_types.h. This is the preparation for the next combo
      spinlock.
      
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      da4c87a0
    • Guo Ren's avatar
      asm-generic: ticket-lock: Optimize arch_spin_value_unlocked · 6e538e74
      Guo Ren authored
      
      
      The arch_spin_value_unlocked would cause an unnecessary memory
      access to the contended value. Although it won't cause a significant
      performance gap in most architectures, the arch_spin_value_unlocked
      argument contains enough information. Thus, remove unnecessary
      atomic_read in arch_spin_value_unlocked().
      
      The caller of arch_spin_value_unlocked() could benefit from this
      change. Currently, the only caller is lockref.
      
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      6e538e74
    • Guo Ren's avatar
      riscv: cmpxchg: Add xchg_small & cmpxchg_small support · ce91fa27
      Guo Ren authored
      
      
      The pvqspinlock needs additional sub-word atomic operations. Here
      is the list:
       - xchg8  (RCsc)
       - xchg16 (Relaxed)
       - cmpxchg8/16_relaxed
       - cmpxchg8/16_release (Rcpc)
       - cmpxchg8_acquire (RCpc)
       - cmpxchg8 (RCsc)
      
      Although paravirt qspinlock doesn't have the native_qspinlock
      fairness, giving a strong forward progress guarantee to these
      atomic semantics could prevent unnecessary tries, which would
      cause cache line bouncing.
      
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      ce91fa27
    • Guo Ren's avatar
      riscv: cmpxchg: Remove unnecessary definitions of cmpxchg & xchg · f3d049e4
      Guo Ren authored
      
      
      The custom xchg/cmpxchg_release macro definitions have no
      difference from the common code from the binary view. The
      xchg32/64 macro definitions have been abandoned in Linux. Thus,
      remove all of them.
      
      This is a preparation for the next cmpxchg_small & xchg8 patches.
      
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      f3d049e4
  2. Aug 02, 2023
  3. Jul 31, 2023
    • Guo Ren's avatar
      riscv: kexec: Fixup synchronization problem between init_mm and active_mm · 9d27c14a
      Guo Ren authored
      The machine_kexec() uses set_memory_x to modify the direct mapping
      attributes from RW to RWX. The current implementation of set_memory_x
      does not split hugepages in the linear mapping and then when a PGD
      mapping is used, the whole PGD is marked as executable. But changing
      the permissions at the PGD level must be propagated to all the page
      tables. When kexec jumps into control_buffer, the instruction page
      fault happens, and there is no minor_pagefault for it, then panic.
      
      The bug is found on an MMU_sv39 machine, and the direct mapping used a
      1GB PUD, the pgd entries. Here is the bug output:
      
       kexec_core: Starting new kernel
       Will call new kernel at 00300000 from hart id 0
       FDT image at 747c7000
       Bye...
       Unable to handle kernel paging request at virtual address ffffffda23b0d000
       Oops [#1]
       Modules linked in:
       CPU: 0 PID: 53 Comm: uinit Not tainted 6.4.0-rc6 #15
       Hardware name: Sophgo Mango (DT)
       epc : 0xffffffda23b0d000
        ra : machine_kexec+0xa6/0xb0
       epc : ffffffda23b0d000 ra : ffffffff80008272 sp : ffffffc80c173d10
        gp : ffffffff8150e1e0 tp : ffffffd9073d2c40 t0 : 0000000000000000
        t1 : 0000000000000042 t2 : 6567616d69205444 s0 : ffffffc80c173d50
        s1 : ffffffd9076c4800 a0 : ffffffd9076c4800 a1 : 0000000000300000
        a2 : 00000000747c7000 a3 : 0000000000000000 a4 : ffffffd800000000
        a5 : 0000000000000000 a6 : ffffffd903619c40 a7 : ffffffffffffffff
        s2 : ffffffda23b0d000 s3 : 0000000000300000 s4 : 00000000747c7000
        s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000
        s8 : 0000000000000000 s9 : 0000000000000000 s10: 0000000000000000
        s11: 0000003f940001a0 t3 : ffffffff815351af t4 : ffffffff815351af
        t5 : ffffffff815351b0 t6 : ffffffc80c173b50
       status: 0000000200000100 badaddr: ffffffda23b0d000 cause: 000000000000000c
      
      Given the current flaw in the set_memory_x implementation, the simplest
      solution is to fix machine_kexec() to remap control code page outside
      the linear mapping. Because the control code buffer was moved from the
      direct mapping area to the vmalloc location, we need an additional
      va_va_offset to fix up va_pa_offset.
      
      Fixes: 3335068f
      
       ("riscv: Use PUD/P4D/PGD pages for the linear mapping")
      Reviewed-by: default avatarAlexandre Ghiti <alexghiti@rivosinc.com>
      Reported-by: default avatarXing XiaoGuang <xingxg2008@163.com>
      Signed-off-by: default avatarGuo Ren <guoren@kernel.org>
      Tested-by: default avatarXing Xiaoguang <xingxg2008@163.com>
      Signed-off-by: default avatarGuo Ren <guoren@linux.alibaba.com>
      9d27c14a
  4. Jul 27, 2023
  5. Jul 21, 2023