- Aug 05, 2023
-
-
Guo Ren authored
Connect riscv to Compact NUMA-aware lock (CNA), which uses PRARAVIRT_SPINLOCKS static_call hooks. See numa_spinlock= of Documentation/admin-guide/kernel-parameters.txt for trying. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org>
-
Guo Ren authored
The pv_ops belongs to x86 custom infrastructure and cleans up the cna_configure_spin_lock_slowpath() with standard code. This is preparation for riscv support CNA qspoinlock. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org>
-
Guo Ren authored
Add trace point for pv_kick/wait, here is the output: entries-in-buffer/entries-written: 33927/33927 #P:12 _-----=> irqs-off/BH-disabled / _----=> need-resched | / _---=> hardirq/softirq || / _--=> preempt-depth ||| / _-=> migrate-disable |||| / delay TASK-PID CPU# ||||| TIMESTAMP FUNCTION | | | ||||| | | sh-100 [001] d..2. 28.312294: pv_wait: cpu 1 out of wfi <idle>-0 [000] d.h4. 28.322030: pv_kick: cpu 0 kick target cpu 1 sh-100 [001] d..2. 30.982631: pv_wait: cpu 1 out of wfi <idle>-0 [000] d.h4. 30.993289: pv_kick: cpu 0 kick target cpu 1 sh-100 [002] d..2. 44.987573: pv_wait: cpu 2 out of wfi <idle>-0 [000] d.h4. 44.989000: pv_kick: cpu 0 kick target cpu 2 <idle>-0 [003] d.s3. 51.593978: pv_kick: cpu 3 kick target cpu 4 rcu_sched-15 [004] d..2. 51.595192: pv_wait: cpu 4 out of wfi lock_torture_wr-115 [004] ...2. 52.656482: pv_kick: cpu 4 kick target cpu 2 lock_torture_wr-113 [002] d..2. 52.659146: pv_wait: cpu 2 out of wfi lock_torture_wr-114 [008] d..2. 52.659507: pv_wait: cpu 8 out of wfi lock_torture_wr-114 [008] d..2. 52.663503: pv_wait: cpu 8 out of wfi lock_torture_wr-113 [002] ...2. 52.666128: pv_kick: cpu 2 kick target cpu 8 lock_torture_wr-114 [008] d..2. 52.667261: pv_wait: cpu 8 out of wfi lock_torture_wr-114 [009] .n.2. 53.141515: pv_kick: cpu 9 kick target cpu 11 lock_torture_wr-113 [002] d..2. 53.143339: pv_wait: cpu 2 out of wfi lock_torture_wr-116 [007] d..2. 53.143412: pv_wait: cpu 7 out of wfi lock_torture_wr-118 [000] d..2. 53.143457: pv_wait: cpu 0 out of wfi lock_torture_wr-115 [008] d..2. 53.143481: pv_wait: cpu 8 out of wfi lock_torture_wr-117 [011] d..2. 53.143522: pv_wait: cpu 11 out of wfi lock_torture_wr-117 [011] ...2. 53.143987: pv_kick: cpu 11 kick target cpu 8 lock_torture_wr-115 [008] ...2. 53.144269: pv_kick: cpu 8 kick target cpu 7 Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org>
-
Guo Ren authored
Add kconfig entry for paravirt_spinlock, an unfair qspinlock virtualization-friendly backend, by halting the virtual CPU rather than spinning. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org>
-
Guo Ren authored
Implement pv_kick with SBI implementation, and add SBI_EXT_PVLOCK extension detection. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org>
-
Guo Ren authored
Disables the qspinlock slow path using PV optimizations which allow the hypervisor to 'idle' the guest on lock contention. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org>
-
Guo Ren authored
We only need to call the kvm_vcpu_kick() and bring target_vcpu from the halt state. No irq raised, no other request, just a pure vcpu_kick. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org>
-
Guo Ren authored
Add the files functions needed to support the SBI PVLOCK (paravirt qspinlock kick_cpu) extension. This is a preparation for the next core implementation of kick_cpu. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org>
-
Guo Ren authored
Using static_call to switch between: native_queued_spin_lock_slowpath() __pv_queued_spin_lock_slowpath() native_queued_spin_unlock() __pv_queued_spin_unlock() Finish the pv_wait implementation, but pv_kick needs the SBI definition of the next patches. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org>
-
Guo Ren authored
Add a static key controlling whether virt_spin_lock() should be called or not. When running on bare metal set the new key to false. The KVM guests fall back to a Test-and-Set spinlock, because fair locks have horrible lock 'holder' preemption issues. The virt_spin_lock_key would shortcut for the queued_spin_lock_slowpath() function that allow virt_spin_lock to hijack it. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org>
-
Guo Ren authored
According to qspinlock requirements, RISC-V gives out a weak LR/SC forward progress guarantee which does not satisfy qspinlock. But many vendors could produce stronger forward guarantee LR/SC to ensure the xchg_tail could be finished in time on any kind of hart. T-HEAD is the vendor which implements strong forward guarantee LR/SC instruction pairs, so enable qspinlock for T-HEAD with errata init help. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org>
-
Guo Ren authored
Allow cmdline to force the kernel to use queued_spinlock when CONFIG_RISCV_COMBO_SPINLOCKS=y. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org>
-
Guo Ren authored
Combo spinlock could support queued and ticket in one Linux Image and select them during boot time via errata mechanism. Here is the func size (Bytes) comparison table below: TYPE : COMBO | TICKET | QUEUED arch_spin_lock : 106 | 60 | 50 arch_spin_unlock : 54 | 36 | 26 arch_spin_trylock : 110 | 72 | 54 arch_spin_is_locked : 48 | 34 | 20 arch_spin_is_contended : 56 | 40 | 24 rch_spin_value_unlocked : 48 | 34 | 24 One example of disassemble combo arch_spin_unlock: 0xffffffff8000409c <+14>: nop # detour slot 0xffffffff800040a0 <+18>: fence rw,w # queued spinlock start 0xffffffff800040a4 <+22>: sb zero,0(a4) # queued spinlock end 0xffffffff800040a8 <+26>: ld s0,8(sp) 0xffffffff800040aa <+28>: addi sp,sp,16 0xffffffff800040ac <+30>: ret 0xffffffff800040ae <+32>: lw a5,0(a4) # ticket spinlock start 0xffffffff800040b0 <+34>: sext.w a5,a5 0xffffffff800040b2 <+36>: fence rw,w 0xffffffff800040b6 <+40>: addiw a5,a5,1 0xffffffff800040b8 <+42>: slli a5,a5,0x30 0xffffffff800040ba <+44>: srli a5,a5,0x30 0xffffffff800040bc <+46>: sh a5,0(a4) # ticket spinlock end 0xffffffff800040c0 <+50>: ld s0,8(sp) 0xffffffff800040c2 <+52>: addi sp,sp,16 0xffffffff800040c4 <+54>: ret The qspinlock is smaller and faster than ticket-lock when all are in fast-path, and combo spinlock could provide a compatible Linux Image for different micro-arch design (weak/strict fwd guarantee LR/SC) processors. Signed-off-by: Guo Ren <guoren@kernel.org> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
-
Guo Ren authored
The requirements of qspinlock have been documented by commit: a8ad07e5 ("asm-generic: qspinlock: Indicate the use of mixed-size atomics"). Although RISC-V ISA gives out a weaker forward guarantee LR/SC, which doesn't satisfy the requirements of qspinlock above, it won't prevent some riscv vendors from implementing a strong fwd guarantee LR/SC in microarchitecture to match xchg_tail requirement. T-HEAD C9xx processor is the one. We've tested the patch on SOPHGO sg2042 & th1520 and passed the stress test on Fedora & Ubuntu & OpenEuler ... Here is the performance comparison between qspinlock and ticket_lock on sg2042 (64 cores): sysbench test=threads threads=32 yields=100 lock=8 (+13.8%): queued_spinlock 0.5109/0.00 ticket_spinlock 0.5814/0.00 perf futex/hash (+6.7%): queued_spinlock 14443937 operations/sec (+- 0.09%) ticket_spinlock 1353215 operations/sec (+- 0.15%) perf futex/wake-parallel (+8.6%): queued_spinlock (waking 1/64 threads) in 0.0253 ms (+-2.90%) ticket_spinlock (waking 1/64 threads) in 0.0275 ms (+-3.12%) perf futex/requeue (+4.2%): queued_spinlock Requeued 64 of 64 threads in 0.0785 ms (+-0.55%) ticket_spinlock Requeued 64 of 64 threads in 0.0818 ms (+-4.12%) System Benchmarks (+6.4%) queued_spinlock: System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 628613745.4 53865.8 Double-Precision Whetstone 55.0 182422.8 33167.8 Execl Throughput 43.0 13116.6 3050.4 File Copy 1024 bufsize 2000 maxblocks 3960.0 7762306.2 19601.8 File Copy 256 bufsize 500 maxblocks 1655.0 3417556.8 20649.9 File Copy 4096 bufsize 8000 maxblocks 5800.0 7427995.7 12806.9 Pipe Throughput 12440.0 23058600.5 18535.9 Pipe-based Context Switching 4000.0 2835617.7 7089.0 Process Creation 126.0 12537.3 995.0 Shell Scripts (1 concurrent) 42.4 57057.4 13456.9 Shell Scripts (8 concurrent) 6.0 7367.1 12278.5 System Call Overhead 15000.0 33308301.3 22205.5 ======== System Benchmarks Index Score 12426.1 ticket_spinlock: System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 626541701.9 53688.2 Double-Precision Whetstone 55.0 181921.0 33076.5 Execl Throughput 43.0 12625.1 2936.1 File Copy 1024 bufsize 2000 maxblocks 3960.0 6553792.9 16550.0 File Copy 256 bufsize 500 maxblocks 1655.0 3189231.6 19270.3 File Copy 4096 bufsize 8000 maxblocks 5800.0 7221277.0 12450.5 Pipe Throughput 12440.0 20594018.7 16554.7 Pipe-based Context Switching 4000.0 2571117.7 6427.8 Process Creation 126.0 10798.4 857.0 Shell Scripts (1 concurrent) 42.4 57227.5 13497.1 Shell Scripts (8 concurrent) 6.0 7329.2 12215.3 System Call Overhead 15000.0 30766778.4 20511.2 ======== System Benchmarks Index Score 11670.7 The qspinlock has a significant improvement on SOPHGO SG2042 64 cores platform than the ticket_lock. Signed-off-by: Guo Ren <guoren@kernel.org> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
-
Guo Ren authored
The early version of T-Head C9xx cores has a store merge buffer delay problem. The store merge buffer could improve the store queue performance by merging multi-store requests, but when there are not continued store requests, the prior single store request would be waiting in the store queue for a long time. That would cause significant problems for communication between multi-cores. This problem was found on sg2042 & th1520 platforms with the qspinlock lock torture test. So appending a fence w.o could immediately flush the store merge buffer and let other cores see the write result. This will apply the WRITE_ONCE errata to handle the non-standard behavior via appending a fence w.o instruction for WRITE_ONCE(). Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org>
-
Guo Ren authored
Move ticket-lock definition into an independent file. This is the preparation for the next combo spinlock of riscv. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org>
-
Guo Ren authored
The arch_spinlock_t of qspinlock has contained the atomic_t val, which satisfies the ticket-lock requirement. Thus, unify the arch_spinlock_t into qspinlock_types.h. This is the preparation for the next combo spinlock. Signed-off-by: Guo Ren <guoren@kernel.org> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
-
Guo Ren authored
The arch_spin_value_unlocked would cause an unnecessary memory access to the contended value. Although it won't cause a significant performance gap in most architectures, the arch_spin_value_unlocked argument contains enough information. Thus, remove unnecessary atomic_read in arch_spin_value_unlocked(). The caller of arch_spin_value_unlocked() could benefit from this change. Currently, the only caller is lockref. Signed-off-by: Guo Ren <guoren@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
-
Guo Ren authored
The pvqspinlock needs additional sub-word atomic operations. Here is the list: - xchg8 (RCsc) - xchg16 (Relaxed) - cmpxchg8/16_relaxed - cmpxchg8/16_release (Rcpc) - cmpxchg8_acquire (RCpc) - cmpxchg8 (RCsc) Although paravirt qspinlock doesn't have the native_qspinlock fairness, giving a strong forward progress guarantee to these atomic semantics could prevent unnecessary tries, which would cause cache line bouncing. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org>
-
Guo Ren authored
The custom xchg/cmpxchg_release macro definitions have no difference from the common code from the binary view. The xchg32/64 macro definitions have been abandoned in Linux. Thus, remove all of them. This is a preparation for the next cmpxchg_small & xchg8 patches. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org>
-
- Aug 02, 2023
-
-
Andrew Jones authored
Now that we can support steal-time accounting, add the kconfig knobs allowing it to be enabled. Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
-
Andrew Jones authored
When the SBI STA extension exists we can use it to implement paravirt steal-time support. Fill in the empty pv-time functions with an SBI STA implementation. Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
-
Andrew Jones authored
The SBI STA extension enables steal-time accounting. Add the definitions it specifies. Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
-
Andrew Jones authored
Add the files and functions needed to support paravirt time on RISC-V. Also include the common code needed for the first application of pv-time, which is steal-time. In the next patches we'll complete the functions to fully enable steal-time support. Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
-
Alex Kogan authored
This performance optimization chooses probabilistically to avoid moving threads from the main queue into the secondary one when the secondary queue is empty. It is helpful when the lock is only lightly contended. In particular, it makes CNA less eager to create a secondary queue, but does not introduce any extra delays for threads waiting in that queue once it is created. Signed-off-by: Alex Kogan <alex.kogan@oracle.com> Reviewed-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Waiman Long <longman@redhat.com>
-
Alex Kogan authored
Prohibit moving certain threads (e.g., in irq and nmi contexts) to the secondary queue. Those prioritized threads will always stay in the primary queue, and so will have a shorter wait time for the lock. Signed-off-by: Alex Kogan <alex.kogan@oracle.com> Reviewed-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Waiman Long <longman@redhat.com>
-
Alex Kogan authored
Keep track of the time the thread at the head of the secondary queue has been waiting, and force inter-node handoff once this time passes a preset threshold. The default value for the threshold (1ms) can be overridden with the new kernel boot command-line option "qspinlock.numa_spinlock_threshold_ns". Signed-off-by: Alex Kogan <alex.kogan@oracle.com> Reviewed-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Waiman Long <longman@redhat.com>
-
Alex Kogan authored
In CNA, spinning threads are organized in two queues, a primary queue for threads running on the same node as the current lock holder, and a secondary queue for threads running on other nodes. After acquiring the MCS lock and before acquiring the spinlock, the MCS lock holder checks whether the next waiter in the primary queue (if exists) is running on the same NUMA node. If it is not, that waiter is detached from the main queue and moved into the tail of the secondary queue. This way, we gradually filter the primary queue, leaving only waiters running on the same preferred NUMA node. For more details, see https://arxiv.org/abs/1810.05600 . Note that this variant of CNA may introduce starvation by continuously passing the lock between waiters in the main queue. This issue will be addressed later in the series. Enabling CNA is controlled via a new configuration option (NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the boot time only if we run on a multi-node machine in native environment and the new config is enabled. (For the time being, the patching requires CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be resolved once static_call() is available.) This default behavior can be overridden with the new kernel boot command-line option "numa_spinlock=on/off" (default is "auto"). Signed-off-by: Alex Kogan <alex.kogan@oracle.com> Reviewed-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Waiman Long <longman@redhat.com>
-
Alex Kogan authored
Move some of the code manipulating the spin lock into separate functions. This would allow easier integration of alternative ways to manipulate that lock. Signed-off-by: Alex Kogan <alex.kogan@oracle.com> Reviewed-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Waiman Long <longman@redhat.com>
-
Alex Kogan authored
The mcs unlock macro (arch_mcs_lock_handoff) should accept the value to be stored into the lock argument as another argument. This allows using the same macro in cases where the value to be stored when passing the lock is different from 1. Signed-off-by: Alex Kogan <alex.kogan@oracle.com> Reviewed-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Waiman Long <longman@redhat.com>
-
- Jul 31, 2023
-
-
Guo Ren authored
The machine_kexec() uses set_memory_x to modify the direct mapping attributes from RW to RWX. The current implementation of set_memory_x does not split hugepages in the linear mapping and then when a PGD mapping is used, the whole PGD is marked as executable. But changing the permissions at the PGD level must be propagated to all the page tables. When kexec jumps into control_buffer, the instruction page fault happens, and there is no minor_pagefault for it, then panic. The bug is found on an MMU_sv39 machine, and the direct mapping used a 1GB PUD, the pgd entries. Here is the bug output: kexec_core: Starting new kernel Will call new kernel at 00300000 from hart id 0 FDT image at 747c7000 Bye... Unable to handle kernel paging request at virtual address ffffffda23b0d000 Oops [#1] Modules linked in: CPU: 0 PID: 53 Comm: uinit Not tainted 6.4.0-rc6 #15 Hardware name: Sophgo Mango (DT) epc : 0xffffffda23b0d000 ra : machine_kexec+0xa6/0xb0 epc : ffffffda23b0d000 ra : ffffffff80008272 sp : ffffffc80c173d10 gp : ffffffff8150e1e0 tp : ffffffd9073d2c40 t0 : 0000000000000000 t1 : 0000000000000042 t2 : 6567616d69205444 s0 : ffffffc80c173d50 s1 : ffffffd9076c4800 a0 : ffffffd9076c4800 a1 : 0000000000300000 a2 : 00000000747c7000 a3 : 0000000000000000 a4 : ffffffd800000000 a5 : 0000000000000000 a6 : ffffffd903619c40 a7 : ffffffffffffffff s2 : ffffffda23b0d000 s3 : 0000000000300000 s4 : 00000000747c7000 s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000 s8 : 0000000000000000 s9 : 0000000000000000 s10: 0000000000000000 s11: 0000003f940001a0 t3 : ffffffff815351af t4 : ffffffff815351af t5 : ffffffff815351b0 t6 : ffffffc80c173b50 status: 0000000200000100 badaddr: ffffffda23b0d000 cause: 000000000000000c Given the current flaw in the set_memory_x implementation, the simplest solution is to fix machine_kexec() to remap control code page outside the linear mapping. Because the control code buffer was moved from the direct mapping area to the vmalloc location, we need an additional va_va_offset to fix up va_pa_offset. Fixes: 3335068f ("riscv: Use PUD/P4D/PGD pages for the linear mapping") Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reported-by: Xing XiaoGuang <xingxg2008@163.com> Signed-off-by: Guo Ren <guoren@kernel.org> Tested-by: Xing Xiaoguang <xingxg2008@163.com> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
-
- Jul 27, 2023
-
- Jul 21, 2023
-
-
Xiaoguang Xing authored
Signed-off-by: Xiaoguang Xing <xiaoguang.xing@sophgo.com>
-
Xiaoguang Xing authored
Fix qspinlock issue that loops to call cpu_relax and not exit. The call trace is: queued_spin_lock_slowpath->arch_mcs_spin_lock_contended ->smp_cond_load_acquire. RISCV has not defined smp_cond_load_acquire, so it uses generic funtion that defined in include/asm-generic/barrier.h. The generic smp_cond_load_acquire calls smp_cond_load_relaxed that loops to call READ_ONCE and cpu_relax. The READ_ONCE need barrier after it to get the new value. Signed-off-by: Xiaoguang Xing <xiaoguang.xing@sophgo.com>
-
Xiaoguang Xing authored
Signed-off-by: Xiaoguang Xing <xiaoguang.xing@sophgo.com>
-
Xiaoguang Xing authored
Signed-off-by: Xiaoguang Xing <xiaoguang.xing@sophgo.com>
-
Xiaoguang Xing authored
Add FORCE_MAX_ZONEORDER to support custom max order requirements. Default 13 is 16MB for requesting large(16MB) contiguous memory. Signed-off-by: Xiaoguang Xing <xiaoguang.xing@sophgo.com>
-
Xiaoguang Xing authored
To avoid pagefault when machine_kexec() call kexec_method that is control_code_buffer. When uses PUD_SIZE as map size, __set_memory only set init_mm So it casues other task page fault when uses the pud entry modified by init_mm. Signed-off-by: Xiaoguang Xing <xiaoguang.xing@sophgo.com>
-
Xiaoguang Xing authored
Signed-off-by: Xiaoguang Xing <xiaoguang.xing@sophgo.com>
-