Skip to content
  1. Mar 10, 2022
    • Jens Axboe's avatar
      io_uring: speedup provided buffer handling · cc3cec83
      Jens Axboe authored
      
      
      In testing high frequency workloads with provided buffers, we spend a
      lot of time in allocating and freeing the buffer units themselves.
      Rather than repeatedly free and alloc them, add a recycling cache
      instead. There are two caches:
      
      - ctx->io_buffers_cache. This is the one we grab from in the submission
        path, and it's protected by ctx->uring_lock. For inline completions,
        we can recycle straight back to this cache and not need any extra
        locking.
      
      - ctx->io_buffers_comp. If we're not under uring_lock, then we use this
        list to recycle buffers. It's protected by the completion_lock.
      
      On adding a new buffer, check io_buffers_cache. If it's empty, check if
      we can splice entries from the io_buffers_comp_cache.
      
      This reduces about 5-10% of overhead from provided buffers, bringing it
      pretty close to the non-provided path.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cc3cec83
    • Jens Axboe's avatar
      io_uring: add support for registering ring file descriptors · e7a6c00d
      Jens Axboe authored
      
      
      Lots of workloads use multiple threads, in which case the file table is
      shared between them. This makes getting and putting the ring file
      descriptor for each io_uring_enter(2) system call more expensive, as it
      involves an atomic get and put for each call.
      
      Similarly to how we allow registering normal file descriptors to avoid
      this overhead, add support for an io_uring_register(2) API that allows
      to register the ring fds themselves:
      
      1) IORING_REGISTER_RING_FDS - takes an array of io_uring_rsrc_update
         structs, and registers them with the task.
      2) IORING_UNREGISTER_RING_FDS - takes an array of io_uring_src_update
         structs, and unregisters them.
      
      When a ring fd is registered, it is internally represented by an offset.
      This offset is returned to the application, and the application then
      uses this offset and sets IORING_ENTER_REGISTERED_RING for the
      io_uring_enter(2) system call. This works just like using a registered
      file descriptor, rather than a real one, in an SQE, where
      IOSQE_FIXED_FILE gets set to tell io_uring that we're using an internal
      offset/descriptor rather than a real file descriptor.
      
      In initial testing, this provides a nice bump in performance for
      threaded applications in real world cases where the batch count (eg
      number of requests submitted per io_uring_enter(2) invocation) is low.
      In a microbenchmark, submitting NOP requests, we see the following
      increases in performance:
      
      Requests per syscall	Baseline	Registered	Increase
      ----------------------------------------------------------------
      1			 ~7030K		 ~8080K		+15%
      2			~13120K		~14800K		+13%
      4			~22740K		~25300K		+11%
      
      Co-developed-by: default avatarXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e7a6c00d
    • Dylan Yudaken's avatar
      io_uring: documentation fixup · 63c36549
      Dylan Yudaken authored
      
      
      Fix incorrect name reference in comment. ki_filp does not exist in the
      struct, but file does.
      
      Signed-off-by: default avatarDylan Yudaken <dylany@fb.com>
      Link: https://lore.kernel.org/r/20220224105157.1332353-1-dylany@fb.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      63c36549
    • Dylan Yudaken's avatar
      io_uring: do not recalculate ppos unnecessarily · b4aec400
      Dylan Yudaken authored
      
      
      There is a slight optimisation to be had by calculating the correct pos
      pointer inside io_kiocb_update_pos and then using that later.
      
      It seems code size drops by a bit:
      000000000000a1b0 0000000000000400 t io_read
      000000000000a5b0 0000000000000319 t io_write
      
      vs
      000000000000a1b0 00000000000003f6 t io_read
      000000000000a5b0 0000000000000310 t io_write
      
      Signed-off-by: default avatarDylan Yudaken <dylany@fb.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b4aec400
    • Dylan Yudaken's avatar
      io_uring: update kiocb->ki_pos at execution time · d34e1e5b
      Dylan Yudaken authored
      
      
      Update kiocb->ki_pos at execution time rather than in io_prep_rw().
      io_prep_rw() happens before the job is enqueued to a worker and so the
      offset might be read multiple times before being executed once.
      
      Ensures that the file position in a set of _linked_ SQEs will be only
      obtained after earlier SQEs have completed, and so will include their
      incremented file position.
      
      Signed-off-by: default avatarDylan Yudaken <dylany@fb.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d34e1e5b
    • Dylan Yudaken's avatar
      io_uring: remove duplicated calls to io_kiocb_ppos · af9c45ec
      Dylan Yudaken authored
      
      
      io_kiocb_ppos is called in both branches, and it seems that the compiler
      does not fuse this. Fusing removes a few bytes from loop_rw_iter.
      
      Before:
      $ nm -S fs/io_uring.o | grep loop_rw_iter
      0000000000002430 0000000000000124 t loop_rw_iter
      
      After:
      $ nm -S fs/io_uring.o | grep loop_rw_iter
      0000000000002430 000000000000010d t loop_rw_iter
      
      Signed-off-by: default avatarDylan Yudaken <dylany@fb.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      af9c45ec
    • Olivier Langlois's avatar
      io_uring: Remove unneeded test in io_run_task_work_sig() · c5020bc8
      Olivier Langlois authored
      
      
      Avoid testing TIF_NOTIFY_SIGNAL twice by calling task_sigpending()
      directly from io_run_task_work_sig()
      
      Signed-off-by: default avatarOlivier Langlois <olivier@trillion01.com>
      Link: https://lore.kernel.org/r/bd7c0495f7656e803e5736708591bb665e6eaacd.1645041650.git.olivier@trillion01.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c5020bc8
    • Stefan Roesch's avatar
      io-uring: Make tracepoints consistent. · 502c87d6
      Stefan Roesch authored
      
      
      This makes the io-uring tracepoints consistent. Where it makes sense
      the tracepoints start with the following four fields:
      - context (ring)
      - request
      - user_data
      - opcode.
      
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Link: https://lore.kernel.org/r/20220214180430.70572-3-shr@fb.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      502c87d6
    • Stefan Roesch's avatar
      io-uring: add __fill_cqe function · d5ec1dfa
      Stefan Roesch authored
      
      
      This introduces the __fill_cqe function. This is necessary
      to correctly issue the io_uring_complete tracepoint.
      
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Link: https://lore.kernel.org/r/20220214180430.70572-2-shr@fb.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d5ec1dfa
    • Hao Xu's avatar
      io-wq: use IO_WQ_ACCT_NR rather than hardcoded number · 86127bb1
      Hao Xu authored
      
      
      It's better to use the defined enum stuff not the hardcoded number to
      define array.
      
      Signed-off-by: default avatarHao Xu <haoxu@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20220206095241.121485-4-haoxu@linux.alibaba.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      86127bb1
    • Hao Xu's avatar
      io-wq: reduce acct->lock crossing functions lock/unlock · e13fb1fe
      Hao Xu authored
      
      
      reduce acct->lock lock and unlock in different functions to make the
      code clearer.
      
      Signed-off-by: default avatarHao Xu <haoxu@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20220206095241.121485-3-haoxu@linux.alibaba.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e13fb1fe
    • Hao Xu's avatar
      io-wq: decouple work_list protection from the big wqe->lock · 42abc95f
      Hao Xu authored
      
      
      wqe->lock is abused, it now protects acct->work_list, hash stuff,
      nr_workers, wqe->free_list and so on. Lets first get the work_list out
      of the wqe-lock mess by introduce a specific lock for work list. This
      is the first step to solve the huge contension between work insertion
      and work consumption.
      good thing:
        - split locking for bound and unbound work list
        - reduce contension between work_list visit and (worker's)free_list.
      
      For the hash stuff, since there won't be a work with same file in both
      bound and unbound work list, thus they won't visit same hash entry. it
      works well to use the new lock to protect hash stuff.
      
      Results:
      set max_unbound_worker = 4, test with echo-server:
      nice -n -15 ./io_uring_echo_server -p 8081 -f -n 1000 -l 16
      (-n connection, -l workload)
      before this patch:
      Samples: 2M of event 'cycles:ppp', Event count (approx.): 1239982111074
      Overhead  Command          Shared Object         Symbol
        28.59%  iou-wrk-10021    [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
         8.89%  io_uring_echo_s  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
         6.20%  iou-wrk-10021    [kernel.vmlinux]      [k] _raw_spin_lock
         2.45%  io_uring_echo_s  [kernel.vmlinux]      [k] io_prep_async_work
         2.36%  iou-wrk-10021    [kernel.vmlinux]      [k] _raw_spin_lock_irqsave
         2.29%  iou-wrk-10021    [kernel.vmlinux]      [k] io_worker_handle_work
         1.29%  io_uring_echo_s  [kernel.vmlinux]      [k] io_wqe_enqueue
         1.06%  iou-wrk-10021    [kernel.vmlinux]      [k] io_wqe_worker
         1.06%  io_uring_echo_s  [kernel.vmlinux]      [k] _raw_spin_lock
         1.03%  iou-wrk-10021    [kernel.vmlinux]      [k] __schedule
         0.99%  iou-wrk-10021    [kernel.vmlinux]      [k] tcp_sendmsg_locked
      
      with this patch:
      Samples: 1M of event 'cycles:ppp', Event count (approx.): 708446691943
      Overhead  Command          Shared Object         Symbol
        16.86%  iou-wrk-10893    [kernel.vmlinux]      [k] native_queued_spin_lock_slowpat
         9.10%  iou-wrk-10893    [kernel.vmlinux]      [k] _raw_spin_lock
         4.53%  io_uring_echo_s  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpat
         2.87%  iou-wrk-10893    [kernel.vmlinux]      [k] io_worker_handle_work
         2.57%  iou-wrk-10893    [kernel.vmlinux]      [k] _raw_spin_lock_irqsave
         2.56%  io_uring_echo_s  [kernel.vmlinux]      [k] io_prep_async_work
         1.82%  io_uring_echo_s  [kernel.vmlinux]      [k] _raw_spin_lock
         1.33%  iou-wrk-10893    [kernel.vmlinux]      [k] io_wqe_worker
         1.26%  io_uring_echo_s  [kernel.vmlinux]      [k] try_to_wake_up
      
      spin_lock failure from 25.59% + 8.89% =  34.48% to 16.86% + 4.53% = 21.39%
      TPS is similar, while cpu usage is from almost 400% to 350%
      
      Signed-off-by: default avatarHao Xu <haoxu@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20220206095241.121485-2-haoxu@linux.alibaba.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      42abc95f
    • Nathan Chancellor's avatar
      io_uring: Fix use of uninitialized ret in io_eventfd_register() · f0a4e62b
      Nathan Chancellor authored
      
      
      Clang warns:
      
        fs/io_uring.c:9396:9: warning: variable 'ret' is uninitialized when used here [-Wuninitialized]
                return ret;
                       ^~~
        fs/io_uring.c:9373:13: note: initialize the variable 'ret' to silence this warning
                int fd, ret;
                           ^
                            = 0
        1 warning generated.
      
      Just return 0 directly and reduce the scope of ret to the if statement,
      as that is the only place that it is used, which is how the function was
      before the fixes commit.
      
      Fixes: 1a75fac9a0f9 ("io_uring: avoid ring quiesce while registering/unregistering eventfd")
      Link: https://github.com/ClangBuiltLinux/linux/issues/1579
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Reviewed-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Link: https://lore.kernel.org/r/20220207162410.1013466-1-nathan@kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f0a4e62b
    • Usama Arif's avatar
      io_uring: remove ring quiesce for io_uring_register · 8bb649ee
      Usama Arif authored
      
      
      None of the opcodes in io_uring_register use ring quiesce anymore. Hence
      io_register_op_must_quiesce always returns false and io_ctx_quiesce is
      never called.
      
      Signed-off-by: default avatarUsama Arif <usama.arif@bytedance.com>
      Link: https://lore.kernel.org/r/20220204145117.1186568-6-usama.arif@bytedance.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8bb649ee
    • Usama Arif's avatar
      io_uring: avoid ring quiesce while registering restrictions and enabling rings · ff16cfcf
      Usama Arif authored
      
      
      IORING_SETUP_R_DISABLED prevents submitting requests and so there will be
      no requests until IORING_REGISTER_ENABLE_RINGS is called. And
      IORING_REGISTER_RESTRICTIONS works only before
      IORING_REGISTER_ENABLE_RINGS is called. Hence ring quiesce is not needed
      for these opcodes.
      
      Signed-off-by: default avatarUsama Arif <usama.arif@bytedance.com>
      Link: https://lore.kernel.org/r/20220204145117.1186568-5-usama.arif@bytedance.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ff16cfcf
    • Usama Arif's avatar
      io_uring: avoid ring quiesce while registering async eventfd · c75312dd
      Usama Arif authored
      
      
      This is done using the RCU data structure (io_ev_fd). eventfd_async is
      moved from io_ring_ctx to io_ev_fd which is RCU protected hence avoiding
      ring quiesce which is much more expensive than an RCU lock. The place
      where eventfd_async is read is already under rcu_read_lock so there is no
      extra RCU read-side critical section needed.
      
      Signed-off-by: default avatarUsama Arif <usama.arif@bytedance.com>
      Link: https://lore.kernel.org/r/20220204145117.1186568-4-usama.arif@bytedance.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c75312dd
    • Usama Arif's avatar
      io_uring: avoid ring quiesce while registering/unregistering eventfd · 77bc59b4
      Usama Arif authored
      
      
      This is done by creating a new RCU data structure (io_ev_fd) as part of
      io_ring_ctx that holds the eventfd_ctx.
      
      The function io_eventfd_signal is executed under rcu_read_lock with a
      single rcu_dereference to io_ev_fd so that if another thread unregisters
      the eventfd while io_eventfd_signal is still being executed, the
      eventfd_signal for which io_eventfd_signal was called completes
      successfully.
      
      The process of registering/unregistering eventfd is already done under
      uring_lock so multiple threads won't enter a race condition while
      registering/unregistering eventfd.
      
      With the above approach ring quiesce can be avoided which is much more
      expensive then using RCU lock. On the system tested, io_uring_register
      with IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared
      to 15ms before with ring quiesce.
      
      Signed-off-by: default avatarUsama Arif <usama.arif@bytedance.com>
      Link: https://lore.kernel.org/r/20220204145117.1186568-3-usama.arif@bytedance.com
      [axboe: long line fixups]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      77bc59b4
    • Usama Arif's avatar
      io_uring: remove trace for eventfd · 2757be22
      Usama Arif authored
      
      
      The information on whether eventfd is registered is not very useful and
      would result in the tracepoint being enclosed in an rcu_readlock in a
      later patch that tries to avoid ring quiesce for registering eventfd.
      
      Suggested-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarUsama Arif <usama.arif@bytedance.com>
      Link: https://lore.kernel.org/r/20220204145117.1186568-2-usama.arif@bytedance.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2757be22
  2. Mar 07, 2022
    • Linus Torvalds's avatar
      Linux 5.17-rc7 · ffb217a1
      Linus Torvalds authored
      v5.17-rc7
      ffb217a1
    • Linus Torvalds's avatar
      Merge tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 3ee65c0f
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
       "A few more fixes for various problems that have user visible effects
        or seem to be urgent:
      
         - fix corruption when combining DIO and non-blocking io_uring over
           multiple extents (seen on MariaDB)
      
         - fix relocation crash due to premature return from commit
      
         - fix quota deadlock between rescan and qgroup removal
      
         - fix item data bounds checks in tree-checker (found on a fuzzed
           image)
      
         - fix fsync of prealloc extents after EOF
      
         - add missing run of delayed items after unlink during log replay
      
         - don't start relocation until snapshot drop is finished
      
         - fix reversed condition for subpage writers locking
      
         - fix warning on page error"
      
      * tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: fallback to blocking mode when doing async dio over multiple extents
        btrfs: add missing run of delayed items after unlink during log replay
        btrfs: qgroup: fix deadlock between rescan worker and remove qgroup
        btrfs: fix relocation crash due to premature return from btrfs_commit_transaction()
        btrfs: do not start relocation until in progress drops are done
        btrfs: tree-checker: use u64 for item data end to avoid overflow
        btrfs: do not WARN_ON() if we have PageError set
        btrfs: fix lost prealloc extents beyond eof after full fsync
        btrfs: subpage: fix a wrong check on subpage->writers
      3ee65c0f
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · f81664f7
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "x86 guest:
      
         - Tweaks to the paravirtualization code, to avoid using them when
           they're pointless or harmful
      
        x86 host:
      
         - Fix for SRCU lockdep splat
      
         - Brown paper bag fix for the propagation of errno"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run
        KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots()
        KVM: x86: Yield to IPI target vCPU only if it is busy
        x86/kvmclock: Fix Hyper-V Isolated VM's boot issue when vCPUs > 64
        x86/kvm: Don't waste memory if kvmclock is disabled
        x86/kvm: Don't use PV TLB/yield when mwait is advertised
      f81664f7
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 9bdeaca1
      Linus Torvalds authored
      Pull powerpc fix from Michael Ellerman:
       "Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set.
      
        Thanks to Murilo Opsfelder Araujo, and Erhard F"
      
      * tag 'powerpc-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/64s: Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set
      9bdeaca1
    • Linus Torvalds's avatar
      Merge tag 'trace-v5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · f40a33f5
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Fix sorting on old "cpu" value in histograms
      
       - Fix return value of __setup() boot parameter handlers
      
      * tag 'trace-v5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        tracing: Fix return value of __setup handlers
        tracing/histogram: Fix sorting on old "cpu" value
      f40a33f5
  3. Mar 06, 2022
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · dcde98da
      Linus Torvalds authored
      Pull input updates from Dmitry Torokhov:
      
       - a fixup for Goodix touchscreen driver allowing it to work on certain
         Cherry Trail devices
      
       - a fix for imbalanced enable/disable regulator in Elam touchpad driver
         that became apparent when used with Asus TF103C 2-in-1 dock
      
       - a couple new input keycodes used on newer keyboards
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        HID: add mapping for KEY_ALL_APPLICATIONS
        HID: add mapping for KEY_DICTATE
        Input: elan_i2c - fix regulator enable count imbalance after suspend/resume
        Input: elan_i2c - move regulator_[en|dis]able() out of elan_[en|dis]able_power()
        Input: goodix - workaround Cherry Trail devices with a bogus ACPI Interrupt() resource
        Input: goodix - use the new soc_intel_is_byt() helper
        Input: samsung-keypad - properly state IOMEM dependency
      dcde98da
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 0014404f
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "8 patches.
      
        Subsystems affected by this patch series: mm (hugetlb, pagemap, and
        userfaultfd), memfd, selftests, and kconfig"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        configs/debug: set CONFIG_DEBUG_INFO=y properly
        proc: fix documentation and description of pagemap
        kselftest/vm: fix tests build with old libc
        memfd: fix F_SEAL_WRITE after shmem huge page allocated
        mm: fix use-after-free when anon vma name is used after vma is freed
        mm: prevent vm_area_struct::anon_name refcount saturation
        mm: refactor vm_area_struct::anon_vma_name usage code
        selftests/vm: cleanup hugetlb file after mremap test
      0014404f
    • Linus Torvalds's avatar
      Merge tag 's390-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · f9026e19
      Linus Torvalds authored
      Pull s390 fixes from Vasily Gorbik:
      
       - Fix HAVE_DYNAMIC_FTRACE_WITH_ARGS implementation by providing correct
         switching between ftrace_caller/ftrace_regs_caller and supplying
         pt_regs only when ftrace_regs_caller is activated.
      
       - Fix exception table sorting.
      
       - Fix breakage of kdump tooling by preserving metadata it cannot
         function without.
      
      * tag 's390-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/extable: fix exception table sorting
        s390/ftrace: fix arch_ftrace_get_regs implementation
        s390/ftrace: fix ftrace_caller/ftrace_regs_caller generation
        s390/setup: preserve memory at OLDMEM_BASE and OLDMEM_SIZE
      f9026e19
    • Qian Cai's avatar
      configs/debug: set CONFIG_DEBUG_INFO=y properly · d1eff16d
      Qian Cai authored
      
      
      CONFIG_DEBUG_INFO can't be set by user directly, so set
      CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y instead.
      
      Otherwise, we end up with no debuginfo in vmlinux which is a big no-no
      for kernel debugging.
      
      Link: https://lkml.kernel.org/r/20220301202920.18488-1-quic_qiancai@quicinc.com
      Signed-off-by: default avatarQian Cai <quic_qiancai@quicinc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1eff16d
    • Yun Zhou's avatar
      proc: fix documentation and description of pagemap · dd21bfa4
      Yun Zhou authored
      Since bit 57 was exported for uffd-wp write-protected (commit
      fb8e37f3: "mm/pagemap: export uffd-wp protection information"),
      fixing it can reduce some unnecessary confusion.
      
      Link: https://lkml.kernel.org/r/20220301044538.3042713-1-yun.zhou@windriver.com
      Fixes: fb8e37f3
      
       ("mm/pagemap: export uffd-wp protection information")
      Signed-off-by: default avatarYun Zhou <yun.zhou@windriver.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
      Cc: Florian Schmidt <florian.schmidt@nutanix.com>
      Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Colin Cross <ccross@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd21bfa4
    • Chengming Zhou's avatar
      kselftest/vm: fix tests build with old libc · b773827e
      Chengming Zhou authored
      
      
      The error message when I build vm tests on debian10 (GLIBC 2.28):
      
          userfaultfd.c: In function `userfaultfd_pagemap_test':
          userfaultfd.c:1393:37: error: `MADV_PAGEOUT' undeclared (first use
          in this function); did you mean `MADV_RANDOM'?
            if (madvise(area_dst, test_pgsize, MADV_PAGEOUT))
                                               ^~~~~~~~~~~~
                                               MADV_RANDOM
      
      This patch includes these newer definitions from UAPI linux/mman.h, is
      useful to fix tests build on systems without these definitions in glibc
      sys/mman.h.
      
      Link: https://lkml.kernel.org/r/20220227055330.43087-2-zhouchengming@bytedance.com
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b773827e
    • Hugh Dickins's avatar
      memfd: fix F_SEAL_WRITE after shmem huge page allocated · f2b277c4
      Hugh Dickins authored
      
      
      Wangyong reports: after enabling tmpfs filesystem to support transparent
      hugepage with the following command:
      
        echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
      
      the docker program tries to add F_SEAL_WRITE through the following
      command, but it fails unexpectedly with errno EBUSY:
      
        fcntl(5, F_ADD_SEALS, F_SEAL_WRITE) = -1.
      
      That is because memfd_tag_pins() and memfd_wait_for_pins() were never
      updated for shmem huge pages: checking page_mapcount() against
      page_count() is hopeless on THP subpages - they need to check
      total_mapcount() against page_count() on THP heads only.
      
      Make memfd_tag_pins() (compared > 1) as strict as memfd_wait_for_pins()
      (compared != 1): either can be justified, but given the non-atomic
      total_mapcount() calculation, it is better now to be strict.  Bear in
      mind that total_mapcount() itself scans all of the THP subpages, when
      choosing to take an XA_CHECK_SCHED latency break.
      
      Also fix the unlikely xa_is_value() case in memfd_wait_for_pins(): if a
      page has been swapped out since memfd_tag_pins(), then its refcount must
      have fallen, and so it can safely be untagged.
      
      Link: https://lkml.kernel.org/r/a4f79248-df75-2c8c-3df-ba3317ccb5da@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarZeal Robot <zealci@zte.com.cn>
      Reported-by: default avatarwangyong <wang.yong12@zte.com.cn>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: CGEL ZTE <cgel.zte@gmail.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Yang Yang <yang.yang29@zte.com.cn>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f2b277c4
    • Suren Baghdasaryan's avatar
      mm: fix use-after-free when anon vma name is used after vma is freed · 942341dc
      Suren Baghdasaryan authored
      When adjacent vmas are being merged it can result in the vma that was
      originally passed to madvise_update_vma being destroyed.  In the current
      implementation, the name parameter passed to madvise_update_vma points
      directly to vma->anon_name and it is used after the call to vma_merge.
      In the cases when vma_merge merges the original vma and destroys it,
      this might result in UAF.  For that the original vma would have to hold
      the anon_vma_name with the last reference.  The following vma would need
      to contain a different anon_vma_name object with the same string.  Such
      scenario is shown below:
      
      madvise_vma_behavior(vma)
        madvise_update_vma(vma, ..., anon_name == vma->anon_name)
          vma_merge(vma)
            __vma_adjust(vma) <-- merges vma with adjacent one
              vm_area_free(vma) <-- frees the original vma
          replace_vma_anon_name(anon_name) <-- UAF of vma->anon_name
      
      Fix this by raising the name refcount and stabilizing it.
      
      Link: https://lkml.kernel.org/r/20220224231834.1481408-3-surenb@google.com
      Link: https://lkml.kernel.org/r/20220223153613.835563-3-surenb@google.com
      Fixes: 9a10064f
      
       ("mm: add a field to store names for private anonymous memory")
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reported-by: default avatar <syzbot+aa7b3d4b35f9dc46a366@syzkaller.appspotmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexey Gladkov <legion@kernel.org>
      Cc: Chris Hyser <chris.hyser@oracle.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Colin Cross <ccross@google.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      942341dc
    • Suren Baghdasaryan's avatar
      mm: prevent vm_area_struct::anon_name refcount saturation · 96403e11
      Suren Baghdasaryan authored
      
      
      A deep process chain with many vmas could grow really high.  With
      default sysctl_max_map_count (64k) and default pid_max (32k) the max
      number of vmas in the system is 2147450880 and the refcounter has
      headroom of 1073774592 before it reaches REFCOUNT_SATURATED
      (3221225472).
      
      Therefore it's unlikely that an anonymous name refcounter will overflow
      with these defaults.  Currently the max for pid_max is PID_MAX_LIMIT
      (4194304) and for sysctl_max_map_count it's INT_MAX (2147483647).  In
      this configuration anon_vma_name refcount overflow becomes theoretically
      possible (that still require heavy sharing of that anon_vma_name between
      processes).
      
      kref refcounting interface used in anon_vma_name structure will detect a
      counter overflow when it reaches REFCOUNT_SATURATED value but will only
      generate a warning and freeze the ref counter.  This would lead to the
      refcounted object never being freed.  A determined attacker could leak
      memory like that but it would be rather expensive and inefficient way to
      do so.
      
      To ensure anon_vma_name refcount does not overflow, stop anon_vma_name
      sharing when the refcount reaches REFCOUNT_MAX (2147483647), which still
      leaves INT_MAX/2 (1073741823) values before the counter reaches
      REFCOUNT_SATURATED.  This should provide enough headroom for raising the
      refcounts temporarily.
      
      Link: https://lkml.kernel.org/r/20220223153613.835563-2-surenb@google.com
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexey Gladkov <legion@kernel.org>
      Cc: Chris Hyser <chris.hyser@oracle.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Colin Cross <ccross@google.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96403e11
    • Suren Baghdasaryan's avatar
      mm: refactor vm_area_struct::anon_vma_name usage code · 5c26f6ac
      Suren Baghdasaryan authored
      
      
      Avoid mixing strings and their anon_vma_name referenced pointers by
      using struct anon_vma_name whenever possible.  This simplifies the code
      and allows easier sharing of anon_vma_name structures when they
      represent the same name.
      
      [surenb@google.com: fix comment]
      
      Link: https://lkml.kernel.org/r/20220223153613.835563-1-surenb@google.com
      Link: https://lkml.kernel.org/r/20220224231834.1481408-1-surenb@google.com
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Colin Cross <ccross@google.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Alexey Gladkov <legion@kernel.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Chris Hyser <chris.hyser@oracle.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c26f6ac
    • Mike Kravetz's avatar
      selftests/vm: cleanup hugetlb file after mremap test · ff712a62
      Mike Kravetz authored
      
      
      The hugepage-mremap test will create a file in a hugetlb filesystem.  In
      a default 'run_vmtests' run, the file will contain all the hugetlb
      pages.  After the test, the file remains and there are no free hugetlb
      pages for subsequent tests.  This causes those hugetlb tests to fail.
      
      Change hugepage-mremap to take the name of the hugetlb file as an
      argument.  Unlink the file within the test, and just to be sure remove
      the file in the run_vmtests script.
      
      Link: https://lkml.kernel.org/r/20220201033459.156944-1-mike.kravetz@oracle.com
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMina Almasry <almasrymina@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff712a62
  4. Mar 05, 2022
    • Murilo Opsfelder Araujo's avatar
      powerpc/64s: Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set · 58dbe9b3
      Murilo Opsfelder Araujo authored
      The following build failure occurs when CONFIG_PPC_64S_HASH_MMU is not
      set:
      
          arch/powerpc/kernel/setup_64.c: In function ‘setup_per_cpu_areas’:
          arch/powerpc/kernel/setup_64.c:811:21: error: ‘mmu_linear_psize’ undeclared (first use in this function); did you mean ‘mmu_virtual_psize’?
            811 |                 if (mmu_linear_psize == MMU_PAGE_4K)
                |                     ^~~~~~~~~~~~~~~~
                |                     mmu_virtual_psize
          arch/powerpc/kernel/setup_64.c:811:21: note: each undeclared identifier is reported only once for each function it appears in
      
      Move the declaration of mmu_linear_psize outside of
      CONFIG_PPC_64S_HASH_MMU ifdef.
      
      After the above is fixed, it fails later with the following error:
      
          ld: arch/powerpc/kexec/file_load_64.o: in function `.arch_kexec_kernel_image_probe':
          file_load_64.c:(.text+0x1c1c): undefined reference to `.add_htab_mem_range'
      
      Fix that, too, by conditioning add_htab_mem_range() symbol to
      CONFIG_PPC_64S_HASH_MMU.
      
      Fixes: 387e220a
      
       ("powerpc/64s: Move hash MMU support code under CONFIG_PPC_64S_HASH_MMU")
      Reported-by: default avatarErhard F. <erhard_f@mailbox.org>
      Signed-off-by: default avatarMurilo Opsfelder Araujo <muriloo@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215567
      Link: https://lore.kernel.org/r/20220301204743.45133-1-muriloo@linux.ibm.com
      58dbe9b3
    • Linus Torvalds's avatar
      Merge tag 'block-5.17-2022-03-04' of git://git.kernel.dk/linux-block · ac84e82f
      Linus Torvalds authored
      Pull block fix from Jens Axboe:
       "Just a small UAF fix for blktrace"
      
      * tag 'block-5.17-2022-03-04' of git://git.kernel.dk/linux-block:
        blktrace: fix use after free for struct blk_trace
      ac84e82f
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · 07ebd38a
      Linus Torvalds authored
      Pull RISC-V fixes from Palmer Dabbelt:
      
       - Fixes for a handful of KASAN-related crashes.
      
       - A fix to avoid a crash during boot for SPARSEMEM &&
         !SPARSEMEM_VMEMMAP configurations.
      
       - A fix to stop reporting some incorrect errors under DEBUG_VIRTUAL.
      
       - A fix for the K210's device tree to properly populate the interrupt
         map, so hart1 will get interrupts again.
      
      * tag 'riscv-for-linus-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: dts: k210: fix broken IRQs on hart1
        riscv: Fix kasan pud population
        riscv: Move high_memory initialization to setup_bootmem
        riscv: Fix config KASAN && DEBUG_VIRTUAL
        riscv: Fix DEBUG_VIRTUAL false warnings
        riscv: Fix config KASAN && SPARSEMEM && !SPARSE_VMEMMAP
        riscv: Fix is_linear_mapping with recent move of KASAN region
      07ebd38a
    • Linus Torvalds's avatar
      Merge tag 'iommu-fixes-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 3f509f59
      Linus Torvalds authored
      Pull iommu fixes from Joerg Roedel:
      
       - Fix a double list_add() in Intel VT-d code
      
       - Add missing put_device() in Tegra SMMU driver
      
       - Two AMD IOMMU fixes:
           - Memory leak in IO page-table freeing code
           - Add missing recovery from event-log overflow
      
      * tag 'iommu-fixes-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
        iommu/tegra-smmu: Fix missing put_device() call in tegra_smmu_find
        iommu/vt-d: Fix double list_add when enabling VMD in scalable mode
        iommu/amd: Fix I/O page table memory leak
        iommu/amd: Recover from event log overflow
      3f509f59
    • Linus Torvalds's avatar
      Merge tag 'thermal-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · a4ffdb61
      Linus Torvalds authored
      Pull thermal control fix from Rafael Wysocki:
       "Fix NULL pointer dereference in the thermal netlink interface (Nicolas
        Cavallari)"
      
      * tag 'thermal-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        thermal: core: Fix TZ_GET_TRIP NULL pointer dereference
      a4ffdb61
    • Linus Torvalds's avatar
      Merge tag 'sound-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 8d670948
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "Hopefully the last PR for 5.17, including just a few small changes:
        an additional fix for ASoC ops boundary check and other minor
        device-specific fixes"
      
      * tag 'sound-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: intel_hdmi: Fix reference to PCM buffer address
        ASoC: cs4265: Fix the duplicated control name
        ASoC: ops: Shift tested values in snd_soc_put_volsw() by +min
      8d670948