Skip to content
  1. Feb 05, 2022
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · f9aaa5b0
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "10 patches.
      
        Subsystems affected by this patch series: ipc, MAINTAINERS, and mm
        (vmscan, debug, pagemap, kmemleak, and selftests)"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        kselftest/vm: revert "tools/testing/selftests/vm/userfaultfd.c: use swap() to make code cleaner"
        MAINTAINERS: update rppt's email
        mm/kmemleak: avoid scanning potential huge holes
        ipc/sem: do not sleep with a spin lock held
        mm/pgtable: define pte_index so that preprocessor could recognize it
        mm/page_table_check: check entries at pmd levels
        mm/khugepaged: unify collapse pmd clear, flush and free
        mm/page_table_check: use unsigned long for page counters and cleanup
        mm/debug_vm_pgtable: remove pte entry from the page table
        Revert "mm/page_isolation: unset migratetype directly for non Buddy page"
      f9aaa5b0
    • Linus Torvalds's avatar
      Merge tag 'ceph-for-5.17-rc3' of git://github.com/ceph/ceph-client · cff7f223
      Linus Torvalds authored
      Pull ceph fixes from Ilya Dryomov:
       "A patch to make it possible to disable zero copy path in the messenger
        to avoid checksum or authentication tag mismatches and ensuing session
        resets in case the destination buffer isn't guaranteed to be stable"
      
      * tag 'ceph-for-5.17-rc3' of git://github.com/ceph/ceph-client:
        libceph: optionally use bounce buffer on recv path in crc mode
        libceph: make recv path in secure mode work the same as send path
      cff7f223
    • Linus Torvalds's avatar
      Merge tag '9p-for-5.17-rc3' of git://github.com/martinetd/linux · 1eb7de17
      Linus Torvalds authored
      Pull 9p fix from Dominique Martinet:
       "Fix 'cannot walk open fid' rule
      
        The 9p 'walk' operation requires fid arguments to not originate from
        an open or create call and we've missed that for a while as the
        servers regularly running tests with don't enforce the check and no
        active reviewer knew about the rule.
      
        Both reporters confirmed reverting this patch fixes things for them
        and looking at it further wasn't actually required... Will take more
        time for follow up and enforcing the rule more thoroughly later"
      
      * tag '9p-for-5.17-rc3' of git://github.com/martinetd/linux:
        Revert "fs/9p: search open fids first"
      1eb7de17
    • Linus Torvalds's avatar
      Merge tag '5.17-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 · 633a8e89
      Linus Torvalds authored
      Pull cifs fixes from Steve French:
       "SMB3 client fixes including:
      
         - multiple fscache related fixes, reenabling ability to read/write to
           cached files for cifs.ko (that was temporarily disabled for cifs.ko
           a few weeks ago due to the recent fscache changes)
      
         - also includes a new fscache helper function ("query_occupancy")
           used by above
      
         - fix for multiuser mounts and NTLMSSP auth (workstation name) for
           stable
      
         - fix locking ordering problem in multichannel code
      
         - trivial malformed comment fix"
      
      * tag '5.17-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
        cifs: fix workstation_name for multiuser mounts
        Invalidate fscache cookie only when inode attributes are changed.
        cifs: Fix the readahead conversion to manage the batch when reading from cache
        cifs: Implement cache I/O by accessing the cache directly
        netfs, cachefiles: Add a method to query presence of data in the cache
        cifs: Transition from ->readpages() to ->readahead()
        cifs: unlock chan_lock before calling cifs_put_tcp_session
        Fix a warning about a malformed kernel doc comment in cifs
      633a8e89
    • Shuah Khan's avatar
      kselftest/vm: revert "tools/testing/selftests/vm/userfaultfd.c: use swap() to make code cleaner" · 07d2505b
      Shuah Khan authored
      With this change, userfaultfd fails to build with undefined reference
      swap() error:
      
        userfaultfd.c: In function `userfaultfd_stress':
        userfaultfd.c:1530:17: warning: implicit declaration of function `swap'; did you mean `swab'? [-Wimplicit-function-declaration]
         1530 |                 swap(area_src, area_dst);
              |                 ^~~~
              |                 swab
        /usr/bin/ld: /tmp/ccDGOAdV.o: in function `userfaultfd_stress':
        userfaultfd.c:(.text+0x549e): undefined reference to `swap'
        /usr/bin/ld: userfaultfd.c:(.text+0x54bc): undefined reference to `swap'
        collect2: error: ld returned 1 exit status
      
      Revert the commit to fix the problem.
      
      Link: https://lkml.kernel.org/r/20220202003340.87195-1-skhan@linuxfoundation.org
      Fixes: 2c769ed7
      
       ("tools/testing/selftests/vm/userfaultfd.c: use swap() to make code cleaner")
      Signed-off-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Minghao Chi <chi.minghao@zte.com.cn>
      
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07d2505b
    • Mike Rapoport's avatar
      MAINTAINERS: update rppt's email · 6a0fb704
      Mike Rapoport authored
      
      
      Use my @kernel.org address
      
      Link: https://lkml.kernel.org/r/20220203090324.3701774-1-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a0fb704
    • Lang Yu's avatar
      mm/kmemleak: avoid scanning potential huge holes · c10a0f87
      Lang Yu authored
      
      
      When using devm_request_free_mem_region() and devm_memremap_pages() to
      add ZONE_DEVICE memory, if requested free mem region's end pfn were
      huge(e.g., 0x400000000), the node_end_pfn() will be also huge (see
      move_pfn_range_to_zone()).  Thus it creates a huge hole between
      node_start_pfn() and node_end_pfn().
      
      We found on some AMD APUs, amdkfd requested such a free mem region and
      created a huge hole.  In such a case, following code snippet was just
      doing busy test_bit() looping on the huge hole.
      
        for (pfn = start_pfn; pfn < end_pfn; pfn++) {
      	struct page *page = pfn_to_online_page(pfn);
      		if (!page)
      			continue;
      	...
        }
      
      So we got a soft lockup:
      
        watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221]
        CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1
        RIP: 0010:pfn_to_online_page+0x5/0xd0
        Call Trace:
          ? kmemleak_scan+0x16a/0x440
          kmemleak_write+0x306/0x3a0
          ? common_file_perm+0x72/0x170
          full_proxy_write+0x5c/0x90
          vfs_write+0xb9/0x260
          ksys_write+0x67/0xe0
          __x64_sys_write+0x1a/0x20
          do_syscall_64+0x3b/0xc0
          entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      I did some tests with the patch.
      
      (1) amdgpu module unloaded
      
      before the patch:
      
        real    0m0.976s
        user    0m0.000s
        sys     0m0.968s
      
      after the patch:
      
        real    0m0.981s
        user    0m0.000s
        sys     0m0.973s
      
      (2) amdgpu module loaded
      
      before the patch:
      
        real    0m35.365s
        user    0m0.000s
        sys     0m35.354s
      
      after the patch:
      
        real    0m1.049s
        user    0m0.000s
        sys     0m1.042s
      
      Link: https://lkml.kernel.org/r/20211108140029.721144-1-lang.yu@amd.com
      Signed-off-by: default avatarLang Yu <lang.yu@amd.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c10a0f87
    • Minghao Chi's avatar
      ipc/sem: do not sleep with a spin lock held · 520ba724
      Minghao Chi authored
      We can't call kvfree() with a spin lock held, so defer it.
      
      Link: https://lkml.kernel.org/r/20211223031207.556189-1-chi.minghao@zte.com.cn
      Fixes: fc37a3b8
      
       ("[PATCH] ipc sem: use kvmalloc for sem_undo allocation")
      Reported-by: default avatarZeal Robot <zealci@zte.com.cn>
      Signed-off-by: default avatarMinghao Chi <chi.minghao@zte.com.cn>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Yang Guang <cgel.zte@gmail.com>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      520ba724
    • Mike Rapoport's avatar
      mm/pgtable: define pte_index so that preprocessor could recognize it · 314c459a
      Mike Rapoport authored
      Since commit 974b9b2c ("mm: consolidate pte_index() and
      pte_offset_*() definitions") pte_index is a static inline and there is
      no define for it that can be recognized by the preprocessor.  As a
      result, vm_insert_pages() uses slower loop over vm_insert_page() instead
      of insert_pages() that amortizes the cost of spinlock operations when
      inserting multiple pages.
      
      Link: https://lkml.kernel.org/r/20220111145457.20748-1-rppt@kernel.org
      Fixes: 974b9b2c
      
       ("mm: consolidate pte_index() and pte_offset_*() definitions")
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reported-by: default avatarChristian Dietrich <stettberger@dokucode.de>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      314c459a
    • Pasha Tatashin's avatar
      mm/page_table_check: check entries at pmd levels · 80110bbf
      Pasha Tatashin authored
      syzbot detected a case where the page table counters were not properly
      updated.
      
        syzkaller login:  ------------[ cut here ]------------
        kernel BUG at mm/page_table_check.c:162!
        invalid opcode: 0000 [#1] PREEMPT SMP KASAN
        CPU: 0 PID: 3099 Comm: pasha Not tainted 5.16.0+ #48
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO4
        RIP: 0010:__page_table_check_zero+0x159/0x1a0
        Call Trace:
         free_pcp_prepare+0x3be/0xaa0
         free_unref_page+0x1c/0x650
         free_compound_page+0xec/0x130
         free_transhuge_page+0x1be/0x260
         __put_compound_page+0x90/0xd0
         release_pages+0x54c/0x1060
         __pagevec_release+0x7c/0x110
         shmem_undo_range+0x85e/0x1250
        ...
      
      The repro involved having a huge page that is split due to uprobe event
      temporarily replacing one of the pages in the huge page.  Later the huge
      page was combined again, but the counters were off, as the PTE level was
      not properly updated.
      
      Make sure that when PMD is cleared and prior to freeing the level the
      PTEs are updated.
      
      Link: https://lkml.kernel.org/r/20220131203249.2832273-5-pasha.tatashin@soleen.com
      Fixes: df4e817b
      
       ("mm: page table check")
      Signed-off-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80110bbf
    • Pasha Tatashin's avatar
      mm/khugepaged: unify collapse pmd clear, flush and free · e59a47b8
      Pasha Tatashin authored
      
      
      Unify the code that flushes, clears pmd entry, and frees the PTE table
      level into a new function collapse_and_free_pmd().
      
      This cleanup is useful as in the next patch we will add another call to
      this function to iterate through PTE prior to freeing the level for page
      table check.
      
      Link: https://lkml.kernel.org/r/20220131203249.2832273-4-pasha.tatashin@soleen.com
      Signed-off-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e59a47b8
    • Pasha Tatashin's avatar
      mm/page_table_check: use unsigned long for page counters and cleanup · 64d8b9e1
      Pasha Tatashin authored
      
      
      For consistency, use "unsigned long" for all page counters.
      
      Also, reduce code duplication by calling __page_table_check_*_clear()
      from __page_table_check_*_set() functions.
      
      Link: https://lkml.kernel.org/r/20220131203249.2832273-3-pasha.tatashin@soleen.com
      Signed-off-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarWei Xu <weixugc@google.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64d8b9e1
    • Pasha Tatashin's avatar
      mm/debug_vm_pgtable: remove pte entry from the page table · fb5222aa
      Pasha Tatashin authored
      Patch series "page table check fixes and cleanups", v5.
      
      This patch (of 4):
      
      The pte entry that is used in pte_advanced_tests() is never removed from
      the page table at the end of the test.
      
      The issue is detected by page_table_check, to repro compile kernel with
      the following configs:
      
      CONFIG_DEBUG_VM_PGTABLE=y
      CONFIG_PAGE_TABLE_CHECK=y
      CONFIG_PAGE_TABLE_CHECK_ENFORCED=y
      
      During the boot the following BUG is printed:
      
        debug_vm_pgtable: [debug_vm_pgtable         ]: Validating architecture page table helpers
        ------------[ cut here ]------------
        kernel BUG at mm/page_table_check.c:162!
        invalid opcode: 0000 [#1] PREEMPT SMP PTI
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.16.0-11413-g2c271fe77d52 #3
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
        ...
      
      The entry should be properly removed from the page table before the page
      is released to the free list.
      
      Link: https://lkml.kernel.org/r/20220131203249.2832273-1-pasha.tatashin@soleen.com
      Link: https://lkml.kernel.org/r/20220131203249.2832273-2-pasha.tatashin@soleen.com
      Fixes: a5c3b9ff
      
       ("mm/debug_vm_pgtable: add tests validating advanced arch page table helpers")
      Signed-off-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Tested-by: default avatarZi Yan <ziy@nvidia.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>	[5.9+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb5222aa
    • Chen Wandun's avatar
      Revert "mm/page_isolation: unset migratetype directly for non Buddy page" · a85468b7
      Chen Wandun authored
      This reverts commit 721fb891.
      
      Commit 721fb891 ("mm/page_isolation: unset migratetype directly for
      non Buddy page") will result memory that should in buddy disappear by
      mistake.  move_freepages_block moves all pages in pageblock instead of
      pages indicated by input parameter, so if input pages is not in buddy
      but other pages in pageblock is in buddy, it will result in page out of
      control.
      
      Link: https://lkml.kernel.org/r/20220126024436.13921-1-chenwandun@huawei.com
      Fixes: 721fb891
      
       ("mm/page_isolation: unset migratetype directly for non Buddy page")
      Signed-off-by: default avatarChen Wandun <chenwandun@huawei.com>
      Reported-by: default avatar"kernelci.org bot" <bot@kernelci.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarDong Aisheng <aisheng.dong@nxp.com>
      Tested-by: default avatarFrancesco Dolcini <francesco.dolcini@toradex.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a85468b7
  2. Feb 04, 2022
    • Kees Cook's avatar
      gcc-plugins/stackleak: Use noinstr in favor of notrace · dcb85f85
      Kees Cook authored
      
      
      While the stackleak plugin was already using notrace, objtool is now a
      bit more picky.  Update the notrace uses to noinstr.  Silences the
      following objtool warnings when building with:
      
      CONFIG_DEBUG_ENTRY=y
      CONFIG_STACK_VALIDATION=y
      CONFIG_VMLINUX_VALIDATION=y
      CONFIG_GCC_PLUGIN_STACKLEAK=y
      
        vmlinux.o: warning: objtool: do_syscall_64()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: do_int80_syscall_32()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: exc_general_protection()+0x22: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: fixup_bad_iret()+0x20: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: do_machine_check()+0x27: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: .text+0x5346e: call to stackleak_erase() leaves .noinstr.text section
        vmlinux.o: warning: objtool: .entry.text+0x143: call to stackleak_erase() leaves .noinstr.text section
        vmlinux.o: warning: objtool: .entry.text+0x10eb: call to stackleak_erase() leaves .noinstr.text section
        vmlinux.o: warning: objtool: .entry.text+0x17f9: call to stackleak_erase() leaves .noinstr.text section
      
      Note that the plugin's addition of calls to stackleak_track_stack() from
      noinstr functions is expected to be safe, as it isn't runtime
      instrumentation and is self-contained.
      
      Cc: Alexander Popov <alex.popov@linux.com>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dcb85f85
    • Linus Torvalds's avatar
      Merge tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · eb2eb516
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from bpf, netfilter, and ieee802154.
      
        Current release - regressions:
      
         - Partially revert "net/smc: Add netlink net namespace support", fix
           uABI breakage
      
         - netfilter:
            - nft_ct: fix use after free when attaching zone template
            - nft_byteorder: track register operations
      
        Previous releases - regressions:
      
         - ipheth: fix EOVERFLOW in ipheth_rcvbulk_callback
      
         - phy: qca8081: fix speeds lower than 2.5Gb/s
      
         - sched: fix use-after-free in tc_new_tfilter()
      
        Previous releases - always broken:
      
         - tcp: fix mem under-charging with zerocopy sendmsg()
      
         - tcp: add missing tcp_skb_can_collapse() test in
           tcp_shift_skb_data()
      
         - neigh: do not trigger immediate probes on NUD_FAILED from
           neigh_managed_work, avoid a deadlock
      
         - bpf: use VM_MAP instead of VM_ALLOC for ringbuf, avoid KASAN
           false-positives
      
         - netfilter: nft_reject_bridge: fix for missing reply from prerouting
      
         - smc: forward wakeup to smc socket waitqueue after fallback
      
         - ieee802154:
            - return meaningful error codes from the netlink helpers
            - mcr20a: fix lifs/sifs periods
            - at86rf230, ca8210: stop leaking skbs on error paths
      
         - macsec: add missing un-offload call for NETDEV_UNREGISTER of parent
      
         - ax25: add refcount in ax25_dev to avoid UAF bugs
      
         - eth: mlx5e:
            - fix SFP module EEPROM query
            - fix broken SKB allocation in HW-GRO
            - IPsec offload: fix tunnel mode crypto for non-TCP/UDP flows
      
         - eth: amd-xgbe:
            - fix skb data length underflow
            - ensure reset of the tx_timer_active flag, avoid Tx timeouts
      
         - eth: stmmac: fix runtime pm use in stmmac_dvr_remove()
      
         - eth: e1000e: handshake with CSME starts from Alder Lake platforms"
      
      * tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
        ax25: fix reference count leaks of ax25_dev
        net: stmmac: ensure PTP time register reads are consistent
        net: ipa: request IPA register values be retained
        dt-bindings: net: qcom,ipa: add optional qcom,qmp property
        tools/resolve_btfids: Do not print any commands when building silently
        bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
        net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work
        tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data()
        net: sparx5: do not refer to skb after passing it on
        Partially revert "net/smc: Add netlink net namespace support"
        net/mlx5e: Avoid field-overflowing memcpy()
        net/mlx5e: Use struct_group() for memcpy() region
        net/mlx5e: Avoid implicit modify hdr for decap drop rule
        net/mlx5e: IPsec: Fix tunnel mode crypto offload for non TCP/UDP traffic
        net/mlx5e: IPsec: Fix crypto offload for non TCP/UDP encapsulated traffic
        net/mlx5e: Don't treat small ceil values as unlimited in HTB offload
        net/mlx5: E-Switch, Fix uninitialized variable modact
        net/mlx5e: Fix handling of wrong devices during bond netevent
        net/mlx5e: Fix broken SKB allocation in HW-GRO
        net/mlx5e: Fix wrong calculation of header index in HW_GRO
        ...
      eb2eb516
    • Linus Torvalds's avatar
      Merge tag 'selinux-pr-20220203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux · 551007a8
      Linus Torvalds authored
      Pull selinux fix from Paul Moore:
       "One small SELinux patch to ensure that a policy structure field is
        properly reset after freeing so that we don't inadvertently do a
        double-free on certain error conditions"
      
      * tag 'selinux-pr-20220203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
        selinux: fix double free of cond_list on error paths
      551007a8
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-fixes-5.17-rc3' of... · 25b20ae8
      Linus Torvalds authored
      Merge tag 'linux-kselftest-fixes-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull Kselftest fixes from Shuah Khan:
       "Important fixes to several tests and documentation clarification on
        running mainline kselftest on stable releases. A few notable fixes:
      
         - fix kselftest run hang due to child processes that haven't been
           terminated. Fix signals all child processes
      
         - fix false pass/fail results from vdso_test_abi, openat2, mincore
      
         - build failures when using -j (multiple jobs) option
      
         - exec test build failure due to incorrect build rule for a run-time
           created "pipe"
      
         - zram test fixes related to interaction with zram-generator to make
           sure zram test to coordinate deleted with zram-generator
      
         - zram test compression ratio calculation fix and skipping
           max_comp_streams.
      
         - increasing rtc test timeout
      
         - cpufreq test to write test results to stdout which will necessary
           on automated test systems"
      
      * tag 'linux-kselftest-fixes-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        kselftest: Fix vdso_test_abi return status
        selftests: skip mincore.check_file_mmap when fs lacks needed support
        selftests: openat2: Skip testcases that fail with EOPNOTSUPP
        selftests: openat2: Add missing dependency in Makefile
        selftests: openat2: Print also errno in failure messages
        selftests: futex: Use variable MAKE instead of make
        selftests/exec: Remove pipe from TEST_GEN_FILES
        selftests/zram: Adapt the situation that /dev/zram0 is being used
        selftests/zram01.sh: Fix compression ratio calculation
        selftests/zram: Skip max_comp_streams interface on newer kernel
        docs/kselftest: clarify running mainline tests on stables
        kselftest: signal all child processes
        selftests: cpufreq: Write test output to stdout as well
        selftests: rtc: Increase test timeout so that all tests run
      25b20ae8
    • Duoming Zhou's avatar
      ax25: fix reference count leaks of ax25_dev · 87563a04
      Duoming Zhou authored
      The previous commit d01ffb9e ("ax25: add refcount in ax25_dev
      to avoid UAF bugs") introduces refcount into ax25_dev, but there
      are reference leak paths in ax25_ctl_ioctl(), ax25_fwd_ioctl(),
      ax25_rt_add(), ax25_rt_del() and ax25_rt_opt().
      
      This patch uses ax25_dev_put() and adjusts the position of
      ax25_addr_ax25dev() to fix reference cout leaks of ax25_dev.
      
      Fixes: d01ffb9e
      
       ("ax25: add refcount in ax25_dev to avoid UAF bugs")
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Reviewed-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Link: https://lore.kernel.org/r/20220203150811.42256-1-duoming@zju.edu.cn
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      87563a04
    • Yannick Vignon's avatar
      net: stmmac: ensure PTP time register reads are consistent · 80d46090
      Yannick Vignon authored
      Even if protected from preemption and interrupts, a small time window
      remains when the 2 register reads could return inconsistent values,
      each time the "seconds" register changes. This could lead to an about
      1-second error in the reported time.
      
      Add logic to ensure the "seconds" and "nanoseconds" values are consistent.
      
      Fixes: 92ba6888
      
       ("stmmac: add the support for PTP hw clock driver")
      Signed-off-by: default avatarYannick Vignon <yannick.vignon@nxp.com>
      Reviewed-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Link: https://lore.kernel.org/r/20220203160025.750632-1-yannick.vignon@oss.nxp.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      80d46090
    • Jakub Kicinski's avatar
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 77b1b8b4
      Jakub Kicinski authored
      
      
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2022-02-03
      
      We've added 6 non-merge commits during the last 10 day(s) which contain
      a total of 7 files changed, 11 insertions(+), 236 deletions(-).
      
      The main changes are:
      
      1) Fix BPF ringbuf to allocate its area with VM_MAP instead of VM_ALLOC
         flag which otherwise trips over KASAN, from Hou Tao.
      
      2) Fix unresolved symbol warning in resolve_btfids due to LSM callback
         rename, from Alexei Starovoitov.
      
      3) Fix a possible race in inc_misses_counter() when IRQ would trigger
         during counter update, from He Fengqing.
      
      4) Fix tooling infra for cross-building with clang upon probing whether
         gcc provides the standard libraries, from Jean-Philippe Brucker.
      
      5) Fix silent mode build for resolve_btfids, from Nathan Chancellor.
      
      6) Drop unneeded and outdated lirc.h header copy from tooling infra as
         BPF does not require it anymore, from Sean Young.
      
      * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        tools/resolve_btfids: Do not print any commands when building silently
        bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
        tools: Ignore errors from `which' when searching a GCC toolchain
        tools headers UAPI: remove stale lirc.h
        bpf: Fix possible race in inc_misses_counter
        bpf: Fix renaming task_getsecid_subj->current_getsecid_subj.
      ====================
      
      Link: https://lore.kernel.org/r/20220203155815.25689-1-daniel@iogearbox.net
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      77b1b8b4
    • Mickaël Salaün's avatar
      printk: Fix incorrect __user type in proc_dointvec_minmax_sysadmin() · 1f2cfdd3
      Mickaël Salaün authored
      The move of proc_dointvec_minmax_sysadmin() from kernel/sysctl.c to
      kernel/printk/sysctl.c introduced an incorrect __user attribute to the
      buffer argument.  I spotted this change in [1] as well as the kernel
      test robot.  Revert this change to please sparse:
      
        kernel/printk/sysctl.c:20:51: warning: incorrect type in argument 3 (different address spaces)
        kernel/printk/sysctl.c:20:51:    expected void *
        kernel/printk/sysctl.c:20:51:    got void [noderef] __user *buffer
      
      Fixes: faaa357a
      
       ("printk: move printk sysctl to printk/sysctl.c")
      Link: https://lore.kernel.org/r/20220104155024.48023-2-mic@digikod.net [1]
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: John Ogness <john.ogness@linutronix.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Xiaoming Ni <nixiaoming@huawei.com>
      Signed-off-by: default avatarMickaël Salaün <mic@linux.microsoft.com>
      Link: https://lore.kernel.org/r/20220203145029.272640-1-mic@digikod.net
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f2cfdd3
    • Igor Pylypiv's avatar
      Revert "module, async: async_synchronize_full() on module init iff async is used" · 67d6212a
      Igor Pylypiv authored
      This reverts commit 774a1221.
      
      We need to finish all async code before the module init sequence is
      done.  In the reverted commit the PF_USED_ASYNC flag was added to mark a
      thread that called async_schedule().  Then the PF_USED_ASYNC flag was
      used to determine whether or not async_synchronize_full() needs to be
      invoked.  This works when modprobe thread is calling async_schedule(),
      but it does not work if module dispatches init code to a worker thread
      which then calls async_schedule().
      
      For example, PCI driver probing is invoked from a worker thread based on
      a node where device is attached:
      
      	if (cpu < nr_cpu_ids)
      		error = work_on_cpu(cpu, local_pci_probe, &ddi);
      	else
      		error = local_pci_probe(&ddi);
      
      We end up in a situation where a worker thread gets the PF_USED_ASYNC
      flag set instead of the modprobe thread.  As a result,
      async_synchronize_full() is not invoked and modprobe completes without
      waiting for the async code to finish.
      
      The issue was discovered while loading the pm80xx driver:
      (scsi_mod.scan=async)
      
      modprobe pm80xx                      worker
      ...
        do_init_module()
        ...
          pci_call_probe()
            work_on_cpu(local_pci_probe)
                                           local_pci_probe()
                                             pm8001_pci_probe()
                                               scsi_scan_host()
                                                 async_schedule()
                                                 worker->flags |= PF_USED_ASYNC;
                                           ...
            < return from worker >
        ...
        if (current->flags & PF_USED_ASYNC) <--- false
        	async_synchronize_full();
      
      Commit 21c3c5d2 ("block: don't request module during elevator init")
      fixed the deadlock issue which the reverted commit 774a1221
      ("module, async: async_synchronize_full() on module init iff async is
      used") tried to fix.
      
      Since commit 0fdff3ec
      
       ("async, kmod: warn on synchronous
      request_module() from async workers") synchronous module loading from
      async is not allowed.
      
      Given that the original deadlock issue is fixed and it is no longer
      allowed to call synchronous request_module() from async we can remove
      PF_USED_ASYNC flag to make module init consistently invoke
      async_synchronize_full() unless async module probe is requested.
      
      Signed-off-by: default avatarIgor Pylypiv <ipylypiv@google.com>
      Reviewed-by: default avatarChangyuan Lyu <changyuanl@google.com>
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67d6212a
    • Linus Torvalds's avatar
      Merge branch 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 305e6c42
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
      
       - Eric's fix for a long standing cgroup1 permission issue where it only
         checks for uid 0 instead of CAP which inadvertently allows
         unprivileged userns roots to modify release_agent userhelper
      
       - Fixes for the fallout from Waiman's recent cpuset work
      
      * 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning
        cgroup-v1: Require capabilities to set release_agent
        cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask()
        cgroup/cpuset: Make child cpusets restrict parents on v1 hierarchy
      305e6c42
    • Jakub Kicinski's avatar
      Merge branch 'net-ipa-enable-register-retention' · 0166556a
      Jakub Kicinski authored
      
      
      Alex Elder says:
      
      ====================
      net: ipa: enable register retention
      
      With runtime power management in place, we sometimes need to issue
      a command to enable retention of IPA register values before power
      collapse.  This requires a new Device Tree property, whose presence
      will also be used to signal that the command is required.
      ====================
      
      Link: https://lore.kernel.org/r/20220201150205.468403-1-elder@linaro.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0166556a
    • Alex Elder's avatar
      net: ipa: request IPA register values be retained · 34a08176
      Alex Elder authored
      In some cases, the IPA hardware needs to request the always-on
      subsystem (AOSS) to coordinate with the IPA microcontroller to
      retain IPA register values at power collapse.  This is done by
      issuing a QMP request to the AOSS microcontroller.  A similar
      request ondoes that request.
      
      We must get and hold the "QMP" handle early, because we might get
      back EPROBE_DEFER for that.  But the actual request should be sent
      while we know the IPA clock is active, and when we know the
      microcontroller is operational.
      
      Fixes: 1aac309d
      
       ("net: ipa: use autosuspend")
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      34a08176
    • Alex Elder's avatar
      dt-bindings: net: qcom,ipa: add optional qcom,qmp property · ac62a017
      Alex Elder authored
      
      
      For some systems, the IPA driver must make a request to ensure that
      its registers are retained across power collapse of the IPA hardware.
      On such systems, we'll use the existence of the "qcom,qmp" property
      as a signal that this request is required.
      
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ac62a017
  3. Feb 03, 2022
    • Waiman Long's avatar
      cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning · 2bdfd282
      Waiman Long authored
      It was found that a "suspicious RCU usage" lockdep warning was issued
      with the rcu_read_lock() call in update_sibling_cpumasks().  It is
      because the update_cpumasks_hier() function may sleep. So we have
      to release the RCU lock, call update_cpumasks_hier() and reacquire
      it afterward.
      
      Also add a percpu_rwsem_assert_held() in update_sibling_cpumasks()
      instead of stating that in the comment.
      
      Fixes: 4716909c
      
       ("cpuset: Track cpusets that use parent's effective_cpus")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Tested-by: default avatarPhil Auld <pauld@redhat.com>
      Reviewed-by: default avatarPhil Auld <pauld@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      2bdfd282
    • Nathan Chancellor's avatar
      tools/resolve_btfids: Do not print any commands when building silently · 7f3bdbc3
      Nathan Chancellor authored
      When building with 'make -s', there is some output from resolve_btfids:
      
      $ make -sj"$(nproc)" oldconfig prepare
        MKDIR     .../tools/bpf/resolve_btfids/libbpf/
        MKDIR     .../tools/bpf/resolve_btfids//libsubcmd
        LINK     resolve_btfids
      
      Silent mode means that no information should be emitted about what is
      currently being done. Use the $(silent) variable from Makefile.include
      to avoid defining the msg macro so that there is no information printed.
      
      Fixes: fbbb68de
      
       ("bpf: Add resolve_btfids tool to resolve BTF IDs in ELF object")
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220201212503.731732-1-nathan@kernel.org
      7f3bdbc3
    • John Hubbard's avatar
      Revert "mm/gup: small refactoring: simplify try_grab_page()" · c36c04c2
      John Hubbard authored
      This reverts commit 54d516b1
      
      That commit did a refactoring that effectively combined fast and slow
      gup paths (again).  And that was again incorrect, for two reasons:
      
       a) Fast gup and slow gup get reference counts on pages in different
          ways and with different goals: see Linus' writeup in commit
          cd1adf1b ("Revert "mm/gup: remove try_get_page(), call
          try_get_compound_head() directly""), and
      
       b) try_grab_compound_head() also has a specific check for
          "FOLL_LONGTERM && !is_pinned(page)", that assumes that the caller
          can fall back to slow gup. This resulted in new failures, as
          recently report by Will McVicker [1].
      
      But (a) has problems too, even though they may not have been reported
      yet.  So just revert this.
      
      Link: https://lore.kernel.org/r/20220131203504.3458775-1-willmcvicker@google.com [1]
      Fixes: 54d516b1
      
       ("mm/gup: small refactoring: simplify try_grab_page()")
      Reported-and-tested-by: default avatarWill McVicker <willmcvicker@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: stable@vger.kernel.org # 5.15
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c36c04c2
    • Linus Torvalds's avatar
      Merge tag 'mips-fixes-5.17_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux · d394bb77
      Linus Torvalds authored
      Pull MIPS fixes from Thomas Bogendoerfer:
      
       - fix missed change for PTR->PTR_WD conversion
      
       - kernel-doc fixes
      
      * tag 'mips-fixes-5.17_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
        MIPS: KVM: fix vz.c kernel-doc notation
        MIPS: octeon: Fix missed PTR->PTR_WD conversion
      d394bb77
    • Hou Tao's avatar
      bpf: Use VM_MAP instead of VM_ALLOC for ringbuf · b293dcc4
      Hou Tao authored
      After commit 2fd3fb0be1d1 ("kasan, vmalloc: unpoison VM_ALLOC pages
      after mapping"), non-VM_ALLOC mappings will be marked as accessible
      in __get_vm_area_node() when KASAN is enabled. But now the flag for
      ringbuf area is VM_ALLOC, so KASAN will complain out-of-bound access
      after vmap() returns. Because the ringbuf area is created by mapping
      allocated pages, so use VM_MAP instead.
      
      After the change, info in /proc/vmallocinfo also changes from
        [start]-[end]   24576 ringbuf_map_alloc+0x171/0x290 vmalloc user
      to
        [start]-[end]   24576 ringbuf_map_alloc+0x171/0x290 vmap user
      
      Fixes: 457f4436
      
       ("bpf: Implement BPF ring buffer and verifier support for it")
      Reported-by: default avatar <syzbot+5ad567a418794b9b5983@syzkaller.appspotmail.com>
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220202060158.6260-1-houtao1@huawei.com
      b293dcc4
    • Ryan Bair's avatar
      cifs: fix workstation_name for multiuser mounts · d3b331fb
      Ryan Bair authored
      Set workstation_name from the master_tcon for multiuser mounts.
      
      Just in case, protect size_of_ntlmssp_blob against a NULL workstation_name.
      
      Fixes: 49bd49f9
      
       ("cifs: send workstation name during ntlmssp session setup")
      Cc: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarPaulo Alcantara (SUSE) <pc@cjr.nz>
      Signed-off-by: default avatarRyan Bair <ryandbair@gmail.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      d3b331fb
    • Rohith Surabattula's avatar
      Invalidate fscache cookie only when inode attributes are changed. · 40c845c1
      Rohith Surabattula authored
      
      
      For example if mtime or size has changed.
      
      Signed-off-by: default avatarRohith Surabattula <rohiths@microsoft.com>
      Reviewed-by: default avatarShyam Prasad N <sprasad@microsoft.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      40c845c1
    • Daniel Borkmann's avatar
      net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work · 4a81f6da
      Daniel Borkmann authored
      syzkaller was able to trigger a deadlock for NTF_MANAGED entries [0]:
      
        kworker/0:16/14617 is trying to acquire lock:
        ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652
        [...]
        but task is already holding lock:
        ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: neigh_managed_work+0x35/0x250 net/core/neighbour.c:1572
      
      The neighbor entry turned to NUD_FAILED state, where __neigh_event_send()
      triggered an immediate probe as per commit cd28ca0a ("neigh: reduce
      arp latency") via neigh_probe() given table lock was held.
      
      One option to fix this situation is to defer the neigh_probe() back to
      the neigh_timer_handler() similarly as pre cd28ca0a. For the case
      of NTF_MANAGED, this deferral is acceptable given this only happens on
      actual failure state and regular / expected state is NUD_VALID with the
      entry already present.
      
      The fix adds a parameter to __neigh_event_send() in order to communicate
      whether immediate probe is allowed or disallowed. Existing call-sites
      of neigh_event_send() default as-is to immediate probe. However, the
      neigh_managed_work() disables it via use of neigh_event_send_probe().
      
      [0] <TASK>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
        print_deadlock_bug kernel/locking/lockdep.c:2956 [inline]
        check_deadlock kernel/locking/lockdep.c:2999 [inline]
        validate_chain kernel/locking/lockdep.c:3788 [inline]
        __lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5027
        lock_acquire kernel/locking/lockdep.c:5639 [inline]
        lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5604
        __raw_write_lock_bh include/linux/rwlock_api_smp.h:202 [inline]
        _raw_write_lock_bh+0x2f/0x40 kernel/locking/spinlock.c:334
        ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652
        ip6_finish_output2+0x1070/0x14f0 net/ipv6/ip6_output.c:123
        __ip6_finish_output net/ipv6/ip6_output.c:191 [inline]
        __ip6_finish_output+0x61e/0xe90 net/ipv6/ip6_output.c:170
        ip6_finish_output+0x32/0x200 net/ipv6/ip6_output.c:201
        NF_HOOK_COND include/linux/netfilter.h:296 [inline]
        ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:224
        dst_output include/net/dst.h:451 [inline]
        NF_HOOK include/linux/netfilter.h:307 [inline]
        ndisc_send_skb+0xa99/0x17f0 net/ipv6/ndisc.c:508
        ndisc_send_ns+0x3a9/0x840 net/ipv6/ndisc.c:650
        ndisc_solicit+0x2cd/0x4f0 net/ipv6/ndisc.c:742
        neigh_probe+0xc2/0x110 net/core/neighbour.c:1040
        __neigh_event_send+0x37d/0x1570 net/core/neighbour.c:1201
        neigh_event_send include/net/neighbour.h:470 [inline]
        neigh_managed_work+0x162/0x250 net/core/neighbour.c:1574
        process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
        worker_thread+0x657/0x1110 kernel/workqueue.c:2454
        kthread+0x2e9/0x3a0 kernel/kthread.c:377
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
        </TASK>
      
      Fixes: 7482e384
      
       ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries")
      Reported-by: default avatar <syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Roopa Prabhu <roopa@nvidia.com>
      Tested-by: default avatar <syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20220201193942.5055-1-daniel@iogearbox.net
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4a81f6da
    • Eric Dumazet's avatar
      tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data() · b67985be
      Eric Dumazet authored
      tcp_shift_skb_data() might collapse three packets into a larger one.
      
      P_A, P_B, P_C  -> P_ABC
      
      Historically, it used a single tcp_skb_can_collapse_to(P_A) call,
      because it was enough.
      
      In commit 85712484 ("tcp: coalesce/collapse must respect MPTCP extensions"),
      this call was replaced by a call to tcp_skb_can_collapse(P_A, P_B)
      
      But the now needed test over P_C has been missed.
      
      This probably broke MPTCP.
      
      Then later, commit 9b65b17d ("net: avoid double accounting for pure zerocopy skbs")
      added an extra condition to tcp_skb_can_collapse(), but the missing call
      from tcp_shift_skb_data() is also breaking TCP zerocopy, because P_A and P_C
      might have different skb_zcopy_pure() status.
      
      Fixes: 85712484 ("tcp: coalesce/collapse must respect MPTCP extensions")
      Fixes: 9b65b17d
      
       ("net: avoid double accounting for pure zerocopy skbs")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Mat Martineau <mathew.j.martineau@linux.intel.com>
      Cc: Talal Ahmad <talalahmad@google.com>
      Cc: Arjun Roy <arjunroy@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220201184640.756716-1-eric.dumazet@gmail.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b67985be
    • Linus Torvalds's avatar
      Merge tag 'nfsd-5.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux · 88808fbb
      Linus Torvalds authored
      Pull nfsd fixes from Chuck Lever:
       "Notable bug fixes:
      
         - Ensure SM_NOTIFY doesn't crash the NFS server host
      
         - Ensure NLM locks are cleaned up after client reboot
      
         - Fix a leak of internal NFSv4 lease information"
      
      * tag 'nfsd-5.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
        nfsd: nfsd4_setclientid_confirm mistakenly expires confirmed client.
        lockd: fix failure to cleanup client locks
        lockd: fix server crash on reboot of client holding lock
      88808fbb
    • Linus Torvalds's avatar
      Merge tag 'fsnotify_for_v5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · d5084ffb
      Linus Torvalds authored
      Pull fanotify fix from Jan Kara:
       "Fix stale file descriptor in copy_event_to_user"
      
      * tag 'fsnotify_for_v5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
        fanotify: Fix stale file descriptor in copy_event_to_user()
      d5084ffb
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-kunit-fixes-5.17-rc3' of... · 27bb0b18
      Linus Torvalds authored
      Merge tag 'linux-kselftest-kunit-fixes-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull KUnit fixes from Shuah Khan:
       "A single fix to an error seen on qemu due to a missing import"
      
      * tag 'linux-kselftest-kunit-fixes-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        kunit: tool: Import missing importlib.abc
      27bb0b18
    • Ilya Dryomov's avatar
      libceph: optionally use bounce buffer on recv path in crc mode · 038b8d1d
      Ilya Dryomov authored
      
      
      Both msgr1 and msgr2 in crc mode are zero copy in the sense that
      message data is read from the socket directly into the destination
      buffer.  We assume that the destination buffer is stable (i.e. remains
      unchanged while it is being read to) though.  Otherwise, CRC errors
      ensue:
      
        libceph: read_partial_message 0000000048edf8ad data crc 1063286393 != exp. 228122706
        libceph: osd1 (1)192.168.122.1:6843 bad crc/signature
      
        libceph: bad data crc, calculated 57958023, expected 1805382778
        libceph: osd2 (2)192.168.122.1:6876 integrity error, bad crc
      
      Introduce rxbounce option to enable use of a bounce buffer when
      receiving message data.  In particular this is needed if a mapped
      image is a Windows VM disk, passed to QEMU.  Windows has a system-wide
      "dummy" page that may be mapped into the destination buffer (potentially
      more than once into the same buffer) by the Windows Memory Manager in
      an effort to generate a single large I/O [1][2].  QEMU makes a point of
      preserving overlap relationships when cloning I/O vectors, so krbd gets
      exposed to this behaviour.
      
      [1] "What Is Really in That MDL?"
          https://docs.microsoft.com/en-us/previous-versions/windows/hardware/design/dn614012(v=vs.85)
      [2] https://blogs.msmvps.com/kernelmustard/2005/05/04/dummy-pages/
      
      URL: https://bugzilla.redhat.com/show_bug.cgi?id=1973317
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      038b8d1d