Skip to content
  1. Nov 16, 2021
    • Alexander Lobakin's avatar
      samples/bpf: Fix build error due to -isystem removal · 6060a6cb
      Alexander Lobakin authored
      Since recent Kbuild updates we no longer include files from compiler
      directories. However, samples/bpf/hbm_kern.h hasn't been tuned for
      this (LLVM 13):
      
        CLANG-bpf  samples/bpf/hbm_out_kern.o
      In file included from samples/bpf/hbm_out_kern.c:55:
      samples/bpf/hbm_kern.h:12:10: fatal error: 'stddef.h' file not found
               ^~~~~~~~~~
      1 error generated.
        CLANG-bpf  samples/bpf/hbm_edt_kern.o
      In file included from samples/bpf/hbm_edt_kern.c:53:
      samples/bpf/hbm_kern.h:12:10: fatal error: 'stddef.h' file not found
               ^~~~~~~~~~
      1 error generated.
      
      It is enough to just drop both stdbool.h and stddef.h from includes
      to fix those.
      
      Fixes: 04e85bbf
      
       ("isystem: delete global -isystem compile option")
      Signed-off-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Link: https://lore.kernel.org/bpf/20211115130741.3584-1-alexandr.lobakin@intel.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6060a6cb
    • Alexei Starovoitov's avatar
      Merge branch 'Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs' · 9e4dc892
      Alexei Starovoitov authored
      
      
      Dmitrii Banshchikov says:
      
      ====================
      
      Various locking issues are possible with bpf_ktime_get_coarse_ns() and
      bpf_timer_* set of helpers.
      
      syzbot found a locking issue with bpf_ktime_get_coarse_ns() helper executed in
      BPF_PROG_TYPE_PERF_EVENT prog type - [1]. The issue is possible because the
      helper uses non fast version of time accessor that isn't safe for any context.
      The helper was added because it provided performance benefits in comparison to
      bpf_ktime_get_ns() helper.
      
      A similar locking issue is possible with bpf_timer_* set of helpers when used
      in tracing progs.
      
      The solution is to restrict use of the helpers in tracing progs.
      
      In the [1] discussion it was stated that bpf_spin_lock related helpers shall
      also be excluded for tracing progs. The verifier has a compatibility check
      between a map and a program. If a tracing program tries to use a map which
      value has struct bpf_spin_lock the verifier fails that is why bpf_spin_lock is
      already restricted.
      
      Patch 1 restricts helpers
      Patch 2 adds tests
      
      v1 -> v2:
       * Limit the helpers via func proto getters instead of allowed callback
       * Add note about helpers' restrictions to linux/bpf.h
       * Add Fixes tag
       * Remove extra \0 from btf_str_sec
       * Beside asm tests add prog tests
       * Trim CC
      
      1. https://lore.kernel.org/all/00000000000013aebd05cff8e064@google.com/
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9e4dc892
    • Dmitrii Banshchikov's avatar
      selftests/bpf: Add tests for restricted helpers · e60e6962
      Dmitrii Banshchikov authored
      
      
      This patch adds tests that bpf_ktime_get_coarse_ns(), bpf_timer_* and
      bpf_spin_lock()/bpf_spin_unlock() helpers are forbidden in tracing progs
      as their use there may result in various locking issues.
      
      Signed-off-by: default avatarDmitrii Banshchikov <me@ubique.spb.ru>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211113142227.566439-3-me@ubique.spb.ru
      e60e6962
    • Dmitrii Banshchikov's avatar
      bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs · 5e0bc308
      Dmitrii Banshchikov authored
      Use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in tracing
      progs may result in locking issues.
      
      bpf_ktime_get_coarse_ns() uses ktime_get_coarse_ns() time accessor that
      isn't safe for any context:
      ======================================================
      WARNING: possible circular locking dependency detected
      5.15.0-syzkaller #0 Not tainted
      ------------------------------------------------------
      syz-executor.4/14877 is trying to acquire lock:
      ffffffff8cb30008 (tk_core.seq.seqcount){----}-{0:0}, at: ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255
      
      but task is already holding lock:
      ffffffff90dbf200 (&obj_hash[i].lock){-.-.}-{2:2}, at: debug_object_deactivate+0x61/0x400 lib/debugobjects.c:735
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (&obj_hash[i].lock){-.-.}-{2:2}:
             lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625
             __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
             _raw_spin_lock_irqsave+0xd1/0x120 kernel/locking/spinlock.c:162
             __debug_object_init+0xd9/0x1860 lib/debugobjects.c:569
             debug_hrtimer_init kernel/time/hrtimer.c:414 [inline]
             debug_init kernel/time/hrtimer.c:468 [inline]
             hrtimer_init+0x20/0x40 kernel/time/hrtimer.c:1592
             ntp_init_cmos_sync kernel/time/ntp.c:676 [inline]
             ntp_init+0xa1/0xad kernel/time/ntp.c:1095
             timekeeping_init+0x512/0x6bf kernel/time/timekeeping.c:1639
             start_kernel+0x267/0x56e init/main.c:1030
             secondary_startup_64_no_verify+0xb1/0xbb
      
      -> #0 (tk_core.seq.seqcount){----}-{0:0}:
             check_prev_add kernel/locking/lockdep.c:3051 [inline]
             check_prevs_add kernel/locking/lockdep.c:3174 [inline]
             validate_chain+0x1dfb/0x8240 kernel/locking/lockdep.c:3789
             __lock_acquire+0x1382/0x2b00 kernel/locking/lockdep.c:5015
             lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625
             seqcount_lockdep_reader_access+0xfe/0x230 include/linux/seqlock.h:103
             ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255
             ktime_get_coarse include/linux/timekeeping.h:120 [inline]
             ktime_get_coarse_ns include/linux/timekeeping.h:126 [inline]
             ____bpf_ktime_get_coarse_ns kernel/bpf/helpers.c:173 [inline]
             bpf_ktime_get_coarse_ns+0x7e/0x130 kernel/bpf/helpers.c:171
             bpf_prog_a99735ebafdda2f1+0x10/0xb50
             bpf_dispatcher_nop_func include/linux/bpf.h:721 [inline]
             __bpf_prog_run include/linux/filter.h:626 [inline]
             bpf_prog_run include/linux/filter.h:633 [inline]
             BPF_PROG_RUN_ARRAY include/linux/bpf.h:1294 [inline]
             trace_call_bpf+0x2cf/0x5d0 kernel/trace/bpf_trace.c:127
             perf_trace_run_bpf_submit+0x7b/0x1d0 kernel/events/core.c:9708
             perf_trace_lock+0x37c/0x440 include/trace/events/lock.h:39
             trace_lock_release+0x128/0x150 include/trace/events/lock.h:58
             lock_release+0x82/0x810 kernel/locking/lockdep.c:5636
             __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:149 [inline]
             _raw_spin_unlock_irqrestore+0x75/0x130 kernel/locking/spinlock.c:194
             debug_hrtimer_deactivate kernel/time/hrtimer.c:425 [inline]
             debug_deactivate kernel/time/hrtimer.c:481 [inline]
             __run_hrtimer kernel/time/hrtimer.c:1653 [inline]
             __hrtimer_run_queues+0x2f9/0xa60 kernel/time/hrtimer.c:1749
             hrtimer_interrupt+0x3b3/0x1040 kernel/time/hrtimer.c:1811
             local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline]
             __sysvec_apic_timer_interrupt+0xf9/0x270 arch/x86/kernel/apic/apic.c:1103
             sysvec_apic_timer_interrupt+0x8c/0xb0 arch/x86/kernel/apic/apic.c:1097
             asm_sysvec_apic_timer_interrupt+0x12/0x20
             __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline]
             _raw_spin_unlock_irqrestore+0xd4/0x130 kernel/locking/spinlock.c:194
             try_to_wake_up+0x702/0xd20 kernel/sched/core.c:4118
             wake_up_process kernel/sched/core.c:4200 [inline]
             wake_up_q+0x9a/0xf0 kernel/sched/core.c:953
             futex_wake+0x50f/0x5b0 kernel/futex/waitwake.c:184
             do_futex+0x367/0x560 kernel/futex/syscalls.c:127
             __do_sys_futex kernel/futex/syscalls.c:199 [inline]
             __se_sys_futex+0x401/0x4b0 kernel/futex/syscalls.c:180
             do_syscall_x64 arch/x86/entry/common.c:50 [inline]
             do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      There is a possible deadlock with bpf_timer_* set of helpers:
      hrtimer_start()
        lock_base();
        trace_hrtimer...()
          perf_event()
            bpf_run()
              bpf_timer_start()
                hrtimer_start()
                  lock_base()         <- DEADLOCK
      
      Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in
      BPF_PROG_TYPE_KPROBE, BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_PERF_EVENT
      and BPF_PROG_TYPE_RAW_TRACEPOINT prog types.
      
      Fixes: d0551261 ("bpf: Add bpf_ktime_get_coarse_ns helper")
      Fixes: b00628b1
      
       ("bpf: Introduce bpf timers.")
      Reported-by: default avatar <syzbot+43fd005b5a1b4d10781e@syzkaller.appspotmail.com>
      Signed-off-by: default avatarDmitrii Banshchikov <me@ubique.spb.ru>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211113142227.566439-2-me@ubique.spb.ru
      5e0bc308
  2. Nov 13, 2021
    • Kumar Kartikeya Dwivedi's avatar
      libbpf: Perform map fd cleanup for gen_loader in case of error · ba05fd36
      Kumar Kartikeya Dwivedi authored
      Alexei reported a fd leak issue in gen loader (when invoked from
      bpftool) [0]. When adding ksym support, map fd allocation was moved from
      stack to loader map, however I missed closing these fds (relevant when
      cleanup label is jumped to on error). For the success case, the
      allocated fd is returned in loader ctx, hence this problem is not
      noticed.
      
      Make three changes, first MAX_USED_MAPS in MAX_FD_ARRAY_SZ instead of
      MAX_USED_PROGS, the braino was not a problem until now for this case as
      we didn't try to close map fds (otherwise use of it would have tried
      closing 32 additional fds in ksym btf fd range). Then, do a cleanup for
      all nr_maps fds in cleanup label code, so that in case of error all
      temporary map fds from bpf_gen__map_create are closed.
      
      Then, adjust the cleanup label to only generate code for the required
      number of program and map fds.  To trim code for remaining program
      fds, lay out prog_fd array in stack in the end, so that we can
      directly skip the remaining instances.  Still stack size remains same,
      since changing that would require changes in a lot of places
      (including adjustment of stack_off macro), so nr_progs_sz variable is
      only used to track required number of iterations (and jump over
      cleanup size calculated from that), stack offset calculation remains
      unaffected.
      
      The difference for test_ksyms_module.o is as follows:
      libbpf: //prog cleanup iterations: before = 34, after = 5
      libbpf: //maps cleanup iterations: before = 64, after = 2
      
      Also, move allocation of gen->fd_array offset to bpf_gen__init. Since
      offset can now be 0, and we already continue even if add_data returns 0
      in case of failure, we do not need to distinguish between 0 offset and
      failure case 0, as we rely on bpf_gen__finish to check errors. We can
      also skip check for gen->fd_array in add_*_fd functions, since
      bpf_gen__init will take care of it.
      
        [0]: https://lore.kernel.org/bpf/CAADnVQJ6jSitKSNKyxOrUzwY2qDRX0sPkJ=VLGHuCLVJ=qOt9g@mail.gmail.com
      
      Fixes: 18f4fccb
      
       ("libbpf: Update gen_loader to emit BTF_KIND_FUNC relocations")
      Reported-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211112232022.899074-1-memxor@gmail.com
      ba05fd36
    • Kumar Kartikeya Dwivedi's avatar
      samples/bpf: Fix incorrect use of strlen in xdp_redirect_cpu · 2453afe3
      Kumar Kartikeya Dwivedi authored
      Commit b599015f ("samples/bpf: Fix application of sizeof to pointer")
      tried to fix a bug where sizeof was incorrectly applied to a pointer instead
      of the array string was being copied to, to find the destination buffer size,
      but ended up using strlen, which is still incorrect. However, on closer look
      ifname_buf has no other use, hence directly use optarg.
      
      Fixes: b599015f ("samples/bpf: Fix application of sizeof to pointer")
      Fixes: e531a220
      
       ("samples: bpf: Convert xdp_redirect_cpu to XDP samples helper")
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Tested-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Link: https://lore.kernel.org/bpf/20211112020301.528357-1-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2453afe3
    • Jean-Philippe Brucker's avatar
      tools/runqslower: Fix cross-build · e4ac80ef
      Jean-Philippe Brucker authored
      Commit be79505c ("tools/runqslower: Install libbpf headers when
      building") uses the target libbpf to build the host bpftool, which
      doesn't work when cross-building:
      
        make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- -C tools/bpf/runqslower O=/tmp/runqslower
        ...
          LINK    /tmp/runqslower/bpftool/bpftool
        /usr/bin/ld: /tmp/runqslower/libbpf/libbpf.a(libbpf-in.o): Relocations in generic ELF (EM: 183)
        /usr/bin/ld: /tmp/runqslower/libbpf/libbpf.a: error adding symbols: file in wrong format
        collect2: error: ld returned 1 exit status
      
      When cross-building, the target architecture differs from the host. The
      bpftool used for building runqslower is executed on the host, and thus
      must use a different libbpf than that used for runqslower itself.
      Remove the LIBBPF_OUTPUT and LIBBPF_DESTDIR parameters, so the bpftool
      build makes its own library if necessary.
      
      In the selftests, pass the host bpftool, already a prerequisite for the
      runqslower recipe, as BPFTOOL_OUTPUT. The runqslower Makefile will use
      the bpftool that's already built for selftests instead of making a new
      one.
      
      Fixes: be79505c
      
       ("tools/runqslower: Install libbpf headers when building")
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe@linaro.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20211112155128.565680-1-jean-philippe@linaro.org
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e4ac80ef
    • Alexander Lobakin's avatar
      samples/bpf: Fix summary per-sec stats in xdp_sample_user · dc14ca46
      Alexander Lobakin authored
      sample_summary_print() uses accumulated period to calculate and display
      per-sec averages. This period gets incremented by sampling interval each
      time a new sample is formed, and thus equals to the number of samples
      collected multiplied by this interval.
      
      However, the totals are being calculated differently, they receive current
      sample statistics already divided by the interval gotten as a difference
      between sample timestamps for better precision -- in other words, they are
      being incremented by the per-sec values each sample.
      
      This leads to the excessive division of summary per-secs when interval != 1
      sec. It is obvious pps couldn't become two times lower just from picking a
      different sampling interval value:
      
        $ samples/bpf/xdp_redirect_cpu -p xdp_prognum_n1_inverse_qnum -c all
          -s -d 6 -i 1
        < snip >
          Packets received    : 2,197,230,321
          Average packets/s   : 22,887,816
          Packets redirected  : 2,197,230,472
          Average redir/s     : 22,887,817
        $ samples/bpf/xdp_redirect_cpu -p xdp_prognum_n1_inverse_qnum -c all
          -s -d 6 -i 2
        < snip >
          Packets received    : 159,566,498
          Average packets/s   : 11,397,607
          Packets redirected  : 159,566,995
          Average redir/s     : 11,397,642
      
      This can be easily fixed by treating the divisor not as a period, but rather
      as a total number of samples, and thus incrementing it by 1 instead of
      interval. As a nice side effect, we can now remove so-named argument from a
      couple of functions. Let us also create an "alias" for sample_output::rx_cnt::pps
      named 'num' using a union since this field is used to store this number (period
      previously) as well, and the resulting counter-intuitive code might've been a
      reason for this bug.
      
      Fixes: 156f886c
      
       ("samples: bpf: Add basic infrastructure for XDP samples")
      Signed-off-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Reviewed-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/bpf/20211111215703.690-1-alexandr.lobakin@intel.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      dc14ca46
    • Lorenz Bauer's avatar
      selftests/bpf: Check map in map pruning · 6af2e123
      Lorenz Bauer authored
      
      
      Ensure that two registers with a map_value loaded from a nested
      map are considered equivalent for the purpose of state pruning
      and don't cause the verifier to revisit a pruning point.
      
      This uses a rather crude match on the number of insns visited by
      the verifier, which might change in the future. I've therefore
      tried to keep the code as "unpruneable" as possible by having
      the code paths only converge on the second to last instruction.
      
      Should you require to adjust the test in the future, reducing the
      number of processed instructions should always be safe. Increasing
      them could cause another regression, so proceed with caution.
      
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/CACAyw99hVEJFoiBH_ZGyy=+oO-jyydoz6v1DeKPKs2HVsUH28w@mail.gmail.com
      Link: https://lore.kernel.org/bpf/20211111161452.86864-1-lmb@cloudflare.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6af2e123
  3. Nov 12, 2021
    • Alexei Starovoitov's avatar
      bpf: Fix inner map state pruning regression. · 34d11a44
      Alexei Starovoitov authored
      Introduction of map_uid made two lookups from outer map to be distinct.
      That distinction is only necessary when inner map has an embedded timer.
      Otherwise it will make the verifier state pruning to be conservative
      which will cause complex programs to hit 1M insn_processed limit.
      Tighten map_uid logic to apply to inner maps with timers only.
      
      Fixes: 3e8ce298
      
       ("bpf: Prevent pointer mismatch in bpf_timer_init.")
      Reported-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Link: https://lore.kernel.org/bpf/CACAyw99hVEJFoiBH_ZGyy=+oO-jyydoz6v1DeKPKs2HVsUH28w@mail.gmail.com
      Link: https://lore.kernel.org/bpf/20211110172556.20754-1-alexei.starovoitov@gmail.com
      34d11a44
    • Magnus Karlsson's avatar
      xsk: Fix crash on double free in buffer pool · 199d983b
      Magnus Karlsson authored
      Fix a crash in the buffer pool allocator when a buffer is double
      freed. It is possible to trigger this behavior not only from a faulty
      driver, but also from user space like this: Create a zero-copy AF_XDP
      socket. Load an XDP program that will issue XDP_DROP for all
      packets. Put the same umem buffer into the fill ring multiple times,
      then bind the socket and send some traffic. This will crash the kernel
      as the XDP_DROP action triggers one call to xsk_buff_free()/xp_free()
      for every packet dropped. Each call will add the corresponding buffer
      entry to the free_list and increase the free_list_cnt. Some entries
      will have been added multiple times due to the same buffer being
      freed. The buffer allocation code will then traverse this broken list
      and since the same buffer is in the list multiple times, it will try
      to delete the same buffer twice from the list leading to a crash.
      
      The fix for this is just to test that the buffer has not been added
      before in xp_free(). If it has been, just return from the function and
      do not put it in the free_list a second time.
      
      Note that this bug was not present in the code before the commit
      referenced in the Fixes tag. That code used one list entry per
      allocated buffer, so multiple frees did not have any side effects. But
      the commit below optimized the usage of the pool and only uses a
      single entry per buffer in the umem, meaning that multiple
      allocations/frees of the same buffer will also only use one entry,
      thus leading to the problem.
      
      Fixes: 47e4075d
      
       ("xsk: Batched buffer allocation for the pool")
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarBjörn Töpel <bjorn@kernel.org>
      Link: https://lore.kernel.org/bpf/20211111075707.21922-1-magnus.karlsson@gmail.com
      199d983b
    • Linus Torvalds's avatar
      Merge tag 'pci-v5.16-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · 5833291a
      Linus Torvalds authored
      Pull PCI fixes from Bjorn Helgaas:
       "Revert conversion to struct device.driver instead of struct
        pci_dev.driver.
      
        The device.driver is set earlier, and using it caused the PCI core to
        call driver PM entry points before .probe() and after .remove(), when
        the driver isn't prepared.
      
        This caused NULL pointer dereferences in i2c_designware_pci and
        probably other driver issues"
      
      * tag 'pci-v5.16-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
        Revert "PCI: Use to_pci_driver() instead of pci_dev->driver"
        Revert "PCI: Remove struct pci_dev->driver"
      5833291a
    • Linus Torvalds's avatar
      Merge tag 'kcsan.2021.11.11a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu · ca2ef2d9
      Linus Torvalds authored
      Pull KCSAN updates from Paul McKenney:
       "This contains initialization fixups, testing improvements, addition of
        instruction pointer to data-race reports, and scoped data-race checks"
      
      * tag 'kcsan.2021.11.11a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
        kcsan: selftest: Cleanup and add missing __init
        kcsan: Move ctx to start of argument list
        kcsan: Support reporting scoped read-write access type
        kcsan: Start stack trace with explicit location if provided
        kcsan: Save instruction pointer for scoped accesses
        kcsan: Add ability to pass instruction pointer of access to reporting
        kcsan: test: Fix flaky test case
        kcsan: test: Use kunit_skip() to skip tests
        kcsan: test: Defer kcsan_test_init() after kunit initialization
      ca2ef2d9
    • Linus Torvalds's avatar
      Merge tag 'apparmor-pr-2021-11-10' of... · 5593a733
      Linus Torvalds authored
      Merge tag 'apparmor-pr-2021-11-10' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor
      
      Pull apparmor updates from John Johansen:
       "Features
         - use per file locks for transactional queries
         - update policy management capability checks to work with LSM stacking
      
        Bug Fixes:
         - check/put label on apparmor_sk_clone_security()
         - fix error check on update of label hname
         - fix introspection of of task mode for unconfined tasks
      
        Cleanups:
         - avoid -Wempty-body warning
         - remove duplicated 'Returns:' comments
         - fix doc warning
         - remove unneeded one-line hook wrappers
         - use struct_size() helper in kzalloc()
         - fix zero-length compiler warning in AA_BUG()
         - file.h: delete duplicated word
         - delete repeated words in comments
         - remove repeated declaration"
      
      * tag 'apparmor-pr-2021-11-10' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor:
        apparmor: remove duplicated 'Returns:' comments
        apparmor: remove unneeded one-line hook wrappers
        apparmor: Use struct_size() helper in kzalloc()
        apparmor: fix zero-length compiler warning in AA_BUG()
        apparmor: use per file locks for transactional queries
        apparmor: fix doc warning
        apparmor: Remove the repeated declaration
        apparmor: avoid -Wempty-body warning
        apparmor: Fix internal policy capable check for policy management
        apparmor: fix error check
        security: apparmor: delete repeated words in comments
        security: apparmor: file.h: delete duplicated word
        apparmor: switch to apparmor to internal capable check for policy management
        apparmor: update policy capable checks to use a label
        apparmor: fix introspection of of task mode for unconfined tasks
        apparmor: check/put label on apparmor_sk_clone_security()
      5593a733
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · dbf49896
      Linus Torvalds authored
      Merge more updates from Andrew Morton:
       "The post-linux-next material.
      
        7 patches.
      
        Subsystems affected by this patch series (all mm): debug,
        slab-generic, migration, memcg, and kasan"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        kasan: add kasan mode messages when kasan init
        mm: unexport {,un}lock_page_memcg
        mm: unexport folio_memcg_{,un}lock
        mm/migrate.c: remove MIGRATE_PFN_LOCKED
        mm: migrate: simplify the file-backed pages validation when migrating its mapping
        mm: allow only SLUB on PREEMPT_RT
        mm/page_owner.c: modify the type of argument "order" in some functions
      dbf49896
    • Linus Torvalds's avatar
      Merge tag 'm68knommu-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu · 6d76f6eb
      Linus Torvalds authored
      Pull m68knommu updates from Greg Ungerer:
       "Only two changes.
      
        One removes the now unused CONFIG_MCPU32 symbol. The other sets a
        default for the CONFIG_MEMORY_RESERVE config symbol (this aids
        scripting and other automation) so you don't interactively get asked
        for a value at configure time.
      
        Summary:
      
         - remove unused CONFIG_MCPU32 symbol
      
         - default CONFIG_MEMORY_RESERVE value (for scripting)"
      
      * tag 'm68knommu-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
        m68knommu: Remove MCPU32 config symbol
        m68k: set a default value for MEMORY_RESERVE
      6d76f6eb
    • Bjorn Helgaas's avatar
      Revert "PCI: Use to_pci_driver() instead of pci_dev->driver" · e0217c5b
      Bjorn Helgaas authored
      This reverts commit 2a4d9408.
      
      Robert reported a NULL pointer dereference caused by the PCI core
      (local_pci_probe()) calling the i2c_designware_pci driver's
      .runtime_resume() method before the .probe() method.  i2c_dw_pci_resume()
      depends on initialization done by i2c_dw_pci_probe().
      
      Prior to 2a4d9408 ("PCI: Use to_pci_driver() instead of
      pci_dev->driver"), pci_pm_runtime_resume() avoided calling the
      .runtime_resume() method because pci_dev->driver had not been set yet.
      
      2a4d9408 and b5f9c644 ("PCI: Remove struct pci_dev->driver"),
      removed pci_dev->driver, replacing it by device->driver, which *has* been
      set by this time, so pci_pm_runtime_resume() called the .runtime_resume()
      method when it previously had not.
      
      Fixes: 2a4d9408
      
       ("PCI: Use to_pci_driver() instead of pci_dev->driver")
      Link: https://lore.kernel.org/linux-i2c/CAP145pgdrdiMAT7=-iB1DMgA7t_bMqTcJL4N0=6u8kNY3EU0dw@mail.gmail.com/
      Reported-by: default avatarRobert Święcki <robert@swiecki.net>
      Tested-by: default avatarRobert Święcki <robert@swiecki.net>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      e0217c5b
    • Bjorn Helgaas's avatar
      Revert "PCI: Remove struct pci_dev->driver" · 68da4e0e
      Bjorn Helgaas authored
      This reverts commit b5f9c644.
      
      Revert b5f9c644 ("PCI: Remove struct pci_dev->driver"), which is needed
      to revert 2a4d9408 ("PCI: Use to_pci_driver() instead of
      pci_dev->driver").
      
      2a4d9408 caused a NULL pointer dereference reported by Robert Święcki.
      Details in the revert of that commit.
      
      Fixes: 2a4d9408
      
       ("PCI: Use to_pci_driver() instead of pci_dev->driver")
      Link: https://lore.kernel.org/linux-i2c/CAP145pgdrdiMAT7=-iB1DMgA7t_bMqTcJL4N0=6u8kNY3EU0dw@mail.gmail.com/
      Reported-by: default avatarRobert Święcki <robert@swiecki.net>
      Tested-by: default avatarRobert Święcki <robert@swiecki.net>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      68da4e0e
    • Linus Torvalds's avatar
      Merge tag 'trace-v5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · 600b18f8
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
       "Two locking fixes:
      
         - Add mutex protection to ring_buffer_reset()
      
         - Fix deadlock in modify_ftrace_direct_multi()"
      
      * tag 'trace-v5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        ftrace/direct: Fix lockup in modify_ftrace_direct_multi
        ring-buffer: Protect ring_buffer_reset() from reentrancy
      600b18f8
    • Linus Torvalds's avatar
      Merge tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · f54ca91f
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from bpf, can and netfilter.
      
        Current release - regressions:
      
         - bpf: do not reject when the stack read size is different from the
           tracked scalar size
      
         - net: fix premature exit from NAPI state polling in napi_disable()
      
         - riscv, bpf: fix RV32 broken build, and silence RV64 warning
      
        Current release - new code bugs:
      
         - net: fix possible NULL deref in sock_reserve_memory
      
         - amt: fix error return code in amt_init(); fix stopping the
           workqueue
      
         - ax88796c: use the correct ioctl callback
      
        Previous releases - always broken:
      
         - bpf: stop caching subprog index in the bpf_pseudo_func insn
      
         - security: fixups for the security hooks in sctp
      
         - nfc: add necessary privilege flags in netlink layer, limit
           operations to admin only
      
         - vsock: prevent unnecessary refcnt inc for non-blocking connect
      
         - net/smc: fix sk_refcnt underflow on link down and fallback
      
         - nfnetlink_queue: fix OOB when mac header was cleared
      
         - can: j1939: ignore invalid messages per standard
      
         - bpf, sockmap:
            - fix race in ingress receive verdict with redirect to self
            - fix incorrect sk_skb data_end access when src_reg = dst_reg
            - strparser, and tls are reusing qdisc_skb_cb and colliding
      
         - ethtool: fix ethtool msg len calculation for pause stats
      
         - vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries to
           access an unregistering real_dev
      
         - udp6: make encap_rcv() bump the v6 not v4 stats
      
         - drv: prestera: add explicit padding to fix m68k build
      
         - drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge
      
         - drv: mvpp2: fix wrong SerDes reconfiguration order
      
        Misc & small latecomers:
      
         - ipvs: auto-load ipvs on genl access
      
         - mctp: sanity check the struct sockaddr_mctp padding fields
      
         - libfs: support RENAME_EXCHANGE in simple_rename()
      
         - avoid double accounting for pure zerocopy skbs"
      
      * tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (123 commits)
        selftests/net: udpgso_bench_rx: fix port argument
        net: wwan: iosm: fix compilation warning
        cxgb4: fix eeprom len when diagnostics not implemented
        net: fix premature exit from NAPI state polling in napi_disable()
        net/smc: fix sk_refcnt underflow on linkdown and fallback
        net/mlx5: Lag, fix a potential Oops with mlx5_lag_create_definer()
        gve: fix unmatched u64_stats_update_end()
        net: ethernet: lantiq_etop: Fix compilation error
        selftests: forwarding: Fix packet matching in mirroring selftests
        vsock: prevent unnecessary refcnt inc for nonblocking connect
        net: marvell: mvpp2: Fix wrong SerDes reconfiguration order
        net: ethernet: ti: cpsw_ale: Fix access to un-initialized memory
        net: stmmac: allow a tc-taprio base-time of zero
        selftests: net: test_vxlan_under_vrf: fix HV connectivity test
        net: hns3: allow configure ETS bandwidth of all TCs
        net: hns3: remove check VF uc mac exist when set by PF
        net: hns3: fix some mac statistics is always 0 in device version V2
        net: hns3: fix kernel crash when unload VF while it is being reset
        net: hns3: sync rx ring head in echo common pull
        net: hns3: fix pfc packet number incorrect after querying pfc parameters
        ...
      f54ca91f
    • Linus Torvalds's avatar
      Merge tag 'char-misc-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · c55a0417
      Linus Torvalds authored
      Pull char/misc fix from Greg KH:
       "Here is a single fix for 5.16-rc1 to resolve a build problem that came
        in through the coresight tree (and as such came in through the
        char/misc tree merge in the 5.16-rc1 merge window).
      
        It resolves a build problem with 'allmodconfig' on arm64 and is acked
        by the proper subsystem maintainers. It has been in linux-next all
        week with no reported problems"
      
      * tag 'char-misc-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        arm64: cpufeature: Export this_cpu_has_cap helper
      c55a0417
    • Linus Torvalds's avatar
      Merge tag 'usb-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 5625207d
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are some small reverts and fixes for USB drivers for issues that
        came up during the 5.16-rc1 merge window.
      
        These include:
      
         - two reverts of xhci and USB core patches that are causing problems
           in many systems.
      
         - xhci 3.1 enumeration delay fix for systems that were having
           problems.
      
        All three of these have been in linux-next all week with no reported
        issues"
      
      * tag 'usb-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        xhci: Fix USB 3.1 enumeration issues by increasing roothub power-on-good delay
        Revert "usb: core: hcd: Add support for deferring roothub registration"
        Revert "xhci: Set HCD flag to defer primary roothub registration"
      5625207d
    • Kuan-Ying Lee's avatar
      kasan: add kasan mode messages when kasan init · b873e986
      Kuan-Ying Lee authored
      
      
      There are multiple kasan modes.  It makes sense that we add some
      messages to know which kasan mode is active when booting up [1].
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=212195 [1]
      Link: https://lkml.kernel.org/r/20211020094850.4113-1-Kuan-Ying.Lee@mediatek.com
      Signed-off-by: default avatarKuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: Yee Lee <yee.lee@mediatek.com>
      Cc: Nicholas Tang <nicholas.tang@mediatek.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b873e986
    • Christoph Hellwig's avatar
      mm: unexport {,un}lock_page_memcg · ab2f9d2d
      Christoph Hellwig authored
      
      
      These are only used in built-in core mm code.
      
      Link: https://lkml.kernel.org/r/20210820095815.445392-3-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab2f9d2d
    • Christoph Hellwig's avatar
      mm: unexport folio_memcg_{,un}lock · 913ffbdd
      Christoph Hellwig authored
      
      
      Patch series "unexport memcg locking helpers".
      
      Neither the old page-based nor the new folio-based memcg locking helpers
      are used in modular code at all, so drop the exports.
      
      This patch (of 2):
      
      folio_memcg_{,un}lock are only used in built-in core mm code.
      
      Link: https://lkml.kernel.org/r/20210820095815.445392-1-hch@lst.de
      Link: https://lkml.kernel.org/r/20210820095815.445392-2-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      913ffbdd
    • Alistair Popple's avatar
      mm/migrate.c: remove MIGRATE_PFN_LOCKED · ab09243a
      Alistair Popple authored
      
      
      MIGRATE_PFN_LOCKED is used to indicate to migrate_vma_prepare() that a
      source page was already locked during migrate_vma_collect().  If it
      wasn't then the a second attempt is made to lock the page.  However if
      the first attempt failed it's unlikely a second attempt will succeed,
      and the retry adds complexity.  So clean this up by removing the retry
      and MIGRATE_PFN_LOCKED flag.
      
      Destination pages are also meant to have the MIGRATE_PFN_LOCKED flag
      set, but nothing actually checks that.
      
      Link: https://lkml.kernel.org/r/20211025041608.289017-1-apopple@nvidia.com
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reviewed-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Acked-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab09243a
    • Baolin Wang's avatar
      mm: migrate: simplify the file-backed pages validation when migrating its mapping · 0ef02462
      Baolin Wang authored
      
      
      There is no need to validate the file-backed page's refcount before
      trying to freeze the page's expected refcount, instead we can rely on
      the folio_ref_freeze() to validate if the page has the expected refcount
      before migrating its mapping.
      
      Moreover we are always under the page lock when migrating the page
      mapping, which means nowhere else can remove it from the page cache, so
      we can remove the xas_load() validation under the i_pages lock.
      
      Link: https://lkml.kernel.org/r/cover.1629447552.git.baolin.wang@linux.alibaba.com
      Link: https://lkml.kernel.org/r/df4c129fd8e86a95dbc55f4663d77441cc0d3bd1.1629447552.git.baolin.wang@linux.alibaba.com
      Signed-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ef02462
    • Ingo Molnar's avatar
      mm: allow only SLUB on PREEMPT_RT · 252220da
      Ingo Molnar authored
      
      
      Memory allocators may disable interrupts or preemption as part of the
      allocation and freeing process.  For PREEMPT_RT it is important that
      these sections remain deterministic and short and therefore don't depend
      on the size of the memory to allocate/ free or the inner state of the
      algorithm.
      
      Until v3.12-RT the SLAB allocator was an option but involved several
      changes to meet all the requirements.  The SLUB design fits better with
      PREEMPT_RT model and so the SLAB patches were dropped in the 3.12-RT
      patchset.  Comparing the two allocator, SLUB outperformed SLAB in both
      throughput (time needed to allocate and free memory) and the maximal
      latency of the system measured with cyclictest during hackbench.
      
      SLOB was never evaluated since it was unlikely that it preforms better
      than SLAB.  During a quick test, the kernel crashed with SLOB enabled
      during boot.
      
      Disable SLAB and SLOB on PREEMPT_RT.
      
      [bigeasy@linutronix.de: commit description]
      
      Link: https://lkml.kernel.org/r/20211015210336.gen3tib33ig5q2md@linutronix.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      252220da
    • Yixuan Cao's avatar
      mm/page_owner.c: modify the type of argument "order" in some functions · 0093de69
      Yixuan Cao authored
      
      
      The type of "order" in struct page_owner is unsigned short.
      However, it is unsigned int in the following 3 functions:
      
        __reset_page_owner
        __set_page_owner_handle
        __set_page_owner_handle
      
      The type of "order" in argument list is unsigned int, which is
      inconsistent.
      
      [akpm@linux-foundation.org: update include/linux/page_owner.h]
      
      Link: https://lkml.kernel.org/r/20211020125945.47792-1-caoyixuan2019@email.szu.edu.cn
      Signed-off-by: default avatarYixuan Cao <caoyixuan2019@email.szu.edu.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0093de69
  4. Nov 11, 2021
    • Willem de Bruijn's avatar
      selftests/net: udpgso_bench_rx: fix port argument · d336509c
      Willem de Bruijn authored
      The below commit added optional support for passing a bind address.
      It configures the sockaddr bind arguments before parsing options and
      reconfigures on options -b and -4.
      
      This broke support for passing port (-p) on its own.
      
      Configure sockaddr after parsing all arguments.
      
      Fixes: 3327a9c4
      
       ("selftests: add functionals test for UDP GRO")
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d336509c
    • M Chetan Kumar's avatar
      net: wwan: iosm: fix compilation warning · 29cd3867
      M Chetan Kumar authored
      curr_phase is unused. Removed the dead code.
      
      Fixes: 8d9be063
      
       ("net: wwan: iosm: transport layer support for fw flashing/cd")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarM Chetan Kumar <m.chetan.kumar@linux.intel.com>
      Reviewed-by: default avatarLoic Poulain <loic.poulain@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29cd3867
    • Rahul Lakkireddy's avatar
      cxgb4: fix eeprom len when diagnostics not implemented · 4ca110bf
      Rahul Lakkireddy authored
      Ensure diagnostics monitoring support is implemented for the SFF 8472
      compliant port module and set the correct length for ethtool port
      module eeprom read.
      
      Fixes: f56ec676
      
       ("cxgb4: Add support for ethtool i2c dump")
      Signed-off-by: default avatarManoj Malviya <manojmalviya@chelsio.com>
      Signed-off-by: default avatarRahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ca110bf
    • Alexander Lobakin's avatar
      net: fix premature exit from NAPI state polling in napi_disable() · 0315a075
      Alexander Lobakin authored
      Commit 719c5719 ("net: make napi_disable() symmetric with
      enable") accidentally introduced a bug sometimes leading to a kernel
      BUG when bringing an iface up/down under heavy traffic load.
      
      Prior to this commit, napi_disable() was polling n->state until
      none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then
      always flip them. Now there's a possibility to get away with the
      NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg()
      call with an uninitialized variable, rather than straight to
      another round of the state check.
      
      Error path looks like:
      
      napi_disable():
      unsigned long val, new; /* new is uninitialized */
      
      do {
      	val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or
      				      NAPIF_STATE_SCHED is set */
      	if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */
      		usleep_range(20, 200);
      		continue; /* go straight to the condition check */
      	}
      	new = val | <...>
      } while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg()
      						  writes garbage */
      
      napi_enable():
      do {
      	val = READ_ONCE(n->state);
      	BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */
      <...>
      
      while the typical BUG splat is like:
      
      [  172.652461] ------------[ cut here ]------------
      [  172.652462] kernel BUG at net/core/dev.c:6937!
      [  172.656914] invalid opcode: 0000 [#1] PREEMPT SMP PTI
      [  172.661966] CPU: 36 PID: 2829 Comm: xdp_redirect_cp Tainted: G          I       5.15.0 #42
      [  172.670222] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021
      [  172.680646] RIP: 0010:napi_enable+0x5a/0xd0
      [  172.684832] Code: 07 49 81 cc 00 01 00 00 4c 89 e2 48 89 d8 80 e6 fb f0 48 0f b1 55 10 48 39 c3 74 10 48 8b 5d 10 f6 c7 04 75 3d f6 c3 01 75 b4 <0f> 0b 5b 5d 41 5c c3 65 ff 05 b8 e5 61 53 48 c7 c6 c0 f3 34 ad 48
      [  172.703578] RSP: 0018:ffffa3c9497477a8 EFLAGS: 00010246
      [  172.708803] RAX: ffffa3c96615a014 RBX: 0000000000000000 RCX: ffff8a4b575301a0
      < snip >
      [  172.782403] Call Trace:
      [  172.784857]  <TASK>
      [  172.786963]  ice_up_complete+0x6f/0x210 [ice]
      [  172.791349]  ice_xdp+0x136/0x320 [ice]
      [  172.795108]  ? ice_change_mtu+0x180/0x180 [ice]
      [  172.799648]  dev_xdp_install+0x61/0xe0
      [  172.803401]  dev_xdp_attach+0x1e0/0x550
      [  172.807240]  dev_change_xdp_fd+0x1e6/0x220
      [  172.811338]  do_setlink+0xee8/0x1010
      [  172.814917]  rtnl_setlink+0xe5/0x170
      [  172.818499]  ? bpf_lsm_binder_set_context_mgr+0x10/0x10
      [  172.823732]  ? security_capable+0x36/0x50
      < snip >
      
      Fix this by replacing 'do { } while (cmpxchg())' with an "infinite"
      for-loop with an explicit break.
      
      From v1 [0]:
       - just use a for-loop to simplify both the fix and the existing
         code (Eric).
      
      [0] https://lore.kernel.org/netdev/20211110191126.1214-1-alexandr.lobakin@intel.com
      
      Fixes: 719c5719
      
       ("net: make napi_disable() symmetric with enable")
      Suggested-by: Eric Dumazet <edumazet@google.com> # for-loop
      Signed-off-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Reviewed-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20211110195605.1304-1-alexandr.lobakin@intel.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0315a075
    • Linus Torvalds's avatar
      Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · debe436e
      Linus Torvalds authored
      Pull ext4 updates from Ted Ts'o:
       "Only bug fixes and cleanups for ext4 this merge window.
      
        Of note are fixes for the combination of the inline_data and
        fast_commit fixes, and more accurately calculating when to schedule
        additional lazy inode table init, especially when CONFIG_HZ is 100HZ"
      
      * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext4: fix error code saved on super block during file system abort
        ext4: inline data inode fast commit replay fixes
        ext4: commit inline data during fast commit
        ext4: scope ret locally in ext4_try_to_trim_range()
        ext4: remove an unused variable warning with CONFIG_QUOTA=n
        ext4: fix boolreturn.cocci warnings in fs/ext4/name.c
        ext4: prevent getting empty inode buffer
        ext4: move ext4_fill_raw_inode() related functions
        ext4: factor out ext4_fill_raw_inode()
        ext4: prevent partial update of the extent blocks
        ext4: check for inconsistent extents between index and leaf block
        ext4: check for out-of-order index extents in ext4_valid_extent_entries()
        ext4: convert from atomic_t to refcount_t on ext4_io_end->count
        ext4: refresh the ext4_ext_path struct after dropping i_data_sem.
        ext4: ensure enough credits in ext4_ext_shift_path_extents
        ext4: correct the left/middle/right debug message for binsearch
        ext4: fix lazy initialization next schedule time computation in more granular unit
        Revert "ext4: enforce buffer head state assertion in ext4_da_map_blocks"
      debe436e
    • Linus Torvalds's avatar
      Merge tag 'for-5.16-deadlock-fix-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 6070dcc8
      Linus Torvalds authored
      Pull btrfs fix from David Sterba:
       "Fix for a deadlock when direct/buffered IO is done on a mmaped file
        and a fault happens (details in the patch). There's a fstest
        generic/647 that triggers the problem and makes testing hard"
      
      * tag 'for-5.16-deadlock-fix-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: fix deadlock due to page faults during direct IO reads and writes
      6070dcc8
    • Linus Torvalds's avatar
      Merge tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux · 38764c73
      Linus Torvalds authored
      Pull nfsd updates from Bruce Fields:
       "A slow cycle for nfsd: mainly cleanup, including Neil's patch dropping
        support for a filehandle format deprecated 20 years ago, and further
        xdr-related cleanup from Chuck"
      
      * tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux: (26 commits)
        nfsd4: remove obselete comment
        nfsd: document server-to-server-copy parameters
        NFSD:fix boolreturn.cocci warning
        nfsd: update create verifier comment
        SUNRPC: Change return value type of .pc_encode
        SUNRPC: Replace the "__be32 *p" parameter to .pc_encode
        NFSD: Save location of NFSv4 COMPOUND status
        SUNRPC: Change return value type of .pc_decode
        SUNRPC: Replace the "__be32 *p" parameter to .pc_decode
        SUNRPC: De-duplicate .pc_release() call sites
        SUNRPC: Simplify the SVC dispatch code path
        SUNRPC: Capture value of xdr_buf::page_base
        SUNRPC: Add trace event when alloc_pages_bulk() makes no progress
        svcrdma: Split svcrmda_wc_{read,write} tracepoints
        svcrdma: Split the svcrdma_wc_send() tracepoint
        svcrdma: Split the svcrdma_wc_receive() tracepoint
        NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment()
        SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases
        NFSD: Initialize pointer ni with NULL and not plain integer 0
        NFSD: simplify struct nfsfh
        ...
      38764c73
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 2ec20f48
      Linus Torvalds authored
      Pull NFS client updates from Trond Myklebust:
       "Highlights include:
      
        Features:
         - NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN
         - Optimisations for READDIR and the 'ls -l' style workload
         - Further replacements of dprintk() with tracepoints and other
           tracing improvements
         - Ensure we re-probe NFSv4 server capabilities when the user does a
           "mount -o remount"
      
        Bugfixes:
         - Fix an Oops in pnfs_mark_request_commit()
         - Fix up deadlocks in the commit code
         - Fix regressions in NFSv2/v3 attribute revalidation due to the
           change_attr_type optimisations
         - Fix some dentry verifier races
         - Fix some missing dentry verifier settings
         - Fix a performance regression in nfs_set_open_stateid_locked()
         - SUNRPC was sending multiple SYN calls when re-establishing a TCP
           connection.
         - Fix multiple NFSv4 issues due to missing sanity checking of server
           return values
         - Fix a potential Oops when FREE_STATEID races with an unmount
      
        Cleanups:
         - Clean up the labelled NFS code
         - Remove unused header <linux/pnfs_osd_xdr.h>"
      
      * tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (84 commits)
        NFSv4: Sanity check the parameters in nfs41_update_target_slotid()
        NFS: Remove the nfs4_label argument from decode_getattr_*() functions
        NFS: Remove the nfs4_label argument from nfs_setsecurity
        NFS: Remove the nfs4_label argument from nfs_fhget()
        NFS: Remove the nfs4_label argument from nfs_add_or_obtain()
        NFS: Remove the nfs4_label argument from nfs_instantiate()
        NFS: Remove the nfs4_label from the nfs_setattrres
        NFS: Remove the nfs4_label from the nfs4_getattr_res
        NFS: Remove the f_label from the nfs4_opendata and nfs_openres
        NFS: Remove the nfs4_label from the nfs4_lookupp_res struct
        NFS: Remove the label from the nfs4_lookup_res struct
        NFS: Remove the nfs4_label from the nfs4_link_res struct
        NFS: Remove the nfs4_label from the nfs4_create_res struct
        NFS: Remove the nfs4_label from the nfs_entry struct
        NFS: Create a new nfs_alloc_fattr_with_label() function
        NFS: Always initialise fattr->label in nfs_fattr_alloc()
        NFSv4.2: alloc_file_pseudo() takes an open flag, not an f_mode
        NFS: Don't allocate nfs_fattr on the stack in __nfs42_ssc_open()
        NFSv4: Remove unnecessary 'minor version' check
        NFSv4: Fix potential Oops in decode_op_map()
        ...
      2ec20f48
    • Linus Torvalds's avatar
      Merge branch 'exit-cleanups-for-v5.16' of... · 5147da90
      Linus Torvalds authored
      Merge branch 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
      
      Pull exit cleanups from Eric Biederman:
       "While looking at some issues related to the exit path in the kernel I
        found several instances where the code is not using the existing
        abstractions properly.
      
        This set of changes introduces force_fatal_sig a way of sending a
        signal and not allowing it to be caught, and corrects the misuse of
        the existing abstractions that I found.
      
        A lot of the misuse of the existing abstractions are silly things such
        as doing something after calling a no return function, rolling BUG by
        hand, doing more work than necessary to terminate a kernel thread, or
        calling do_exit(SIGKILL) instead of calling force_sig(SIGKILL).
      
        In the review a deficiency in force_fatal_sig and force_sig_seccomp
        where ptrace or sigaction could prevent the delivery of the signal was
        found. I have added a change that adds SA_IMMUTABLE to change that
        makes it impossible to interrupt the delivery of those signals, and
        allows backporting to fix force_sig_seccomp
      
        And Arnd found an issue where a function passed to kthread_run had the
        wrong prototype, and after my cleanup was failing to build."
      
      * 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
        soc: ti: fix wkup_m3_rproc_boot_thread return type
        signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed
        signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)
        exit/r8188eu: Replace the macro thread_exit with a simple return 0
        exit/rtl8712: Replace the macro thread_exit with a simple return 0
        exit/rtl8723bs: Replace the macro thread_exit with a simple return 0
        signal/x86: In emulate_vsyscall force a signal instead of calling do_exit
        signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig
        signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails
        exit/syscall_user_dispatch: Send ordinary signals on failure
        signal: Implement force_fatal_sig
        exit/kthread: Have kernel threads return instead of calling do_exit
        signal/s390: Use force_sigsegv in default_trap_handler
        signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.
        signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON
        signal/sparc: In setup_tsb_params convert open coded BUG into BUG
        signal/powerpc: On swapcontext failure force SIGSEGV
        signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL)
        signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT
        signal/sparc32: Remove unreachable do_exit in do_sparc_fault
        ...
      5147da90
    • Linus Torvalds's avatar
      Merge tag 'kernel.sys.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux · a41b7445
      Linus Torvalds authored
      Pull prctl updates from Christian Brauner:
       "This contains the missing prctl uapi pieces for PR_SCHED_CORE.
      
        In order to activate core scheduling the caller is expected to specify
        the scope of the new core scheduling domain.
      
        For example, passing 2 in the 4th argument of
      
           prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, <pid>,  2, 0);
      
        would indicate that the new core scheduling domain encompasses all
        tasks in the process group of <pid>. Specifying 0 would only create a
        core scheduling domain for the thread identified by <pid> and 2 would
        encompass the whole thread-group of <pid>.
      
        Note, the values 0, 1, and 2 correspond to PIDTYPE_PID, PIDTYPE_TGID,
        and PIDTYPE_PGID. A first version tried to expose those values
        directly to which I objected because:
      
         - PIDTYPE_* is an enum that is kernel internal which we should not
           expose to userspace directly.
      
         - PIDTYPE_* indicates what a given struct pid is used for it doesn't
           express a scope.
      
        But what the 4th argument of PR_SCHED_CORE prctl() expresses is the
        scope of the operation, i.e. the scope of the core scheduling domain
        at creation time. So Eugene's patch now simply introduces three new
        defines PR_SCHED_CORE_SCOPE_THREAD, PR_SCHED_CORE_SCOPE_THREAD_GROUP,
        and PR_SCHED_CORE_SCOPE_PROCESS_GROUP. They simply express what
        happens.
      
        This has been on the mailing list for quite a while with all relevant
        scheduler folks Cced. I announced multiple times that I'd pick this up
        if I don't see or her anyone else doing it. None of this touches
        proper scheduler code but only concerns uapi so I think this is fine.
      
        With core scheduling being quite common now for vm managers (e.g.
        moving individual vcpu threads into their own core scheduling domain)
        and container managers (e.g. moving the init process into its own core
        scheduling domain and letting all created children inherit it) having
        to rely on raw numbers passed as the 4th argument in prctl() is a bit
        annoying and everyone is starting to come up with their own defines"
      
      * tag 'kernel.sys.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
        uapi/linux/prctl: provide macro definitions for the PR_SCHED_CORE type argument
      a41b7445
    • Linus Torvalds's avatar
      Merge tag 'pidfd.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux · 6752de1a
      Linus Torvalds authored
      Pull pidfd updates from Christian Brauner:
       "Various places in the kernel have picked up pidfds.
      
        The two most recent additions have probably been the ability to use
        pidfds in bpf maps and the usage of pidfds in mm-based syscalls such
        as process_mrelease() and process_madvise().
      
        The same pattern to turn a pidfd into a struct task exists in two
        places. One of those places used PIDTYPE_TGID while the other one used
        PIDTYPE_PID even though it is clearly documented in all pidfd-helpers
        that pidfds __currently__ only refer to thread-group leaders (subject
        to change in the future if need be).
      
        This isn't a bug per se but has the potential to be one if we allow
        pidfds to refer to individual threads. If that happens we want to
        audit all codepaths that make use of them to ensure they can deal with
        pidfds refering to individual threads.
      
        This adds a simple helper to turn a pidfd into a struct task making it
        easy to grep for such places. Plus, it gets rid of code-duplication"
      
      * tag 'pidfd.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
        mm: use pidfd_get_task()
        pid: add pidfd_get_task() helper
      6752de1a