Skip to content
  1. Aug 25, 2023
    • Yin Fengwei's avatar
      madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check · 20b18aad
      Yin Fengwei authored
      Commit fc986a38 ("mm: huge_memory: convert madvise_free_huge_pmd to
      use a folio") replaced the page_mapcount() with folio_mapcount() to check
      whether the folio is shared by other mapping.
      
      It's not correct for large folios. folio_mapcount() returns the total
      mapcount of large folio which is not suitable to detect whether the folio
      is shared.
      
      Use folio_estimated_sharers() which returns a estimated number of shares.
      That means it's not 100% correct. It should be OK for madvise case here.
      
      User-visible effects is that the THP is skipped when user call madvise.
      But the correct behavior is THP should be split and processed then.
      
      NOTE: this change is a temporary fix to reduce the user-visible effects
      before the long term fix from David is ready.
      
      Link: https://lkml.kernel.org/r/20230808020917.2230692-3-fengwei.yin@intel.com
      Fixes: fc986a38
      
       ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio")
      Signed-off-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Reviewed-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      20b18aad
    • Yin Fengwei's avatar
      madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against... · 2f406263
      Yin Fengwei authored
      madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check
      
      Patch series "don't use mapcount() to check large folio sharing", v2.
      
      In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
      folio_mapcount() is used to check whether the folio is shared.  But it's
      not correct as folio_mapcount() returns total mapcount of large folio.
      
      Use folio_estimated_sharers() here as the estimated number is enough.
      
      This patchset will fix the cases:
      User space application call madvise() with MADV_FREE, MADV_COLD and
      MADV_PAGEOUT for specific address range. There are THP mapped to the
      range. Without the patchset, the THP is skipped. With the patch, the
      THP will be split and handled accordingly.
      
      David reported the cow self test skip some cases because of MADV_PAGEOUT
      skip THP:
      https://lore.kernel.org/linux-mm/9e92e42d-488f-47db-ac9d-75b24cd0d037@intel.com/T/#mbf0f2ec7fbe45da47526de1d7036183981691e81
      and I confirmed this patchset make it work again.
      
      
      This patch (of 3):
      
      Commit 07e8c82b ("madvise: convert madvise_cold_or_pageout_pte_range()
      to use folios") replaced the page_mapcount() with folio_mapcount() to
      check whether the folio is shared by other mapping.
      
      It's not correct for large folio.  folio_mapcount() returns the total
      mapcount of large folio which is not suitable to detect whether the folio
      is shared.
      
      Use folio_estimated_sharers() which returns a estimated number of shares. 
      That means it's not 100% correct.  It should be OK for madvise case here.
      
      User-visible effects is that the THP is skipped when user call madvise. 
      But the correct behavior is THP should be split and processed then.
      
      NOTE: this change is a temporary fix to reduce the user-visible effects
      before the long term fix from David is ready.
      
      Link: https://lkml.kernel.org/r/20230808020917.2230692-1-fengwei.yin@intel.com
      Link: https://lkml.kernel.org/r/20230808020917.2230692-2-fengwei.yin@intel.com
      Fixes: 07e8c82b
      
       ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios")
      Signed-off-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Reviewed-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2f406263
  2. Aug 22, 2023
    • T.J. Mercier's avatar
      mm: multi-gen LRU: don't spin during memcg release · 6867c7a3
      T.J. Mercier authored
      When a memcg is in the process of being released mem_cgroup_tryget will
      fail because its reference count has already reached 0.  This can happen
      during reclaim if the memcg has already been offlined, and we reclaim all
      remaining pages attributed to the offlined memcg.  shrink_many attempts to
      skip the empty memcg in this case, and continue reclaiming from the
      remaining memcgs in the old generation.  If there is only one memcg
      remaining, or if all remaining memcgs are in the process of being released
      then shrink_many will spin until all memcgs have finished being released. 
      The release occurs through a workqueue, so it can take a while before
      kswapd is able to make any further progress.
      
      This fix results in reductions in kswapd activity and direct reclaim in
      a test where 28 apps (working set size > total memory) are repeatedly
      launched in a random sequence:
      
                                             A          B      delta   ratio(%)
                 allocstall_movable       5962       3539      -2423     -40.64
                  allocstall_normal       2661       2417       -244      -9.17
      kswapd_high_wmark_hit_quickly      53152       7594     -45558     -85.71
                         pageoutrun      57365      11750     -45615     -79.52
      
      Link: https://lkml.kernel.org/r/20230814151636.1639123-1-tjmercier@google.com
      Fixes: e4dde56c
      
       ("mm: multi-gen LRU: per-node lru_gen_folio lists")
      Signed-off-by: default avatarT.J. Mercier <tjmercier@google.com>
      Acked-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6867c7a3
    • Miaohe Lin's avatar
      mm: memory-failure: fix unexpected return value in soft_offline_page() · e2c1ab07
      Miaohe Lin authored
      When page_handle_poison() fails to handle the hugepage or free page in
      retry path, soft_offline_page() will return 0 while -EBUSY is expected in
      this case.
      
      Consequently the user will think soft_offline_page succeeds while it in
      fact failed.  So the user will not try again later in this case.
      
      Link: https://lkml.kernel.org/r/20230627112808.1275241-1-linmiaohe@huawei.com
      Fixes: b94e0282
      
       ("mm,hwpoison: try to narrow window race for free pages")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e2c1ab07
    • Arnd Bergmann's avatar
      radix tree: remove unused variable · d59070d1
      Arnd Bergmann authored
      Recent versions of clang warn about an unused variable, though older
      versions saw the 'slot++' as a use and did not warn:
      
      radix-tree.c:1136:50: error: parameter 'slot' set but not used [-Werror,-Wunused-but-set-parameter]
      
      It's clearly not needed any more, so just remove it.
      
      Link: https://lkml.kernel.org/r/20230811131023.2226509-1-arnd@kernel.org
      Fixes: 3a08cd52
      
       ("radix tree: Remove multiorder support")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Peng Zhang <zhangpeng.00@bytedance.com>
      Cc: Rong Tao <rongtao@cestc.cn>
      Cc: Tom Rix <trix@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d59070d1
    • Alexandre Ghiti's avatar
      mm: add a call to flush_cache_vmap() in vmap_pfn() · a50420c7
      Alexandre Ghiti authored
      flush_cache_vmap() must be called after new vmalloc mappings are installed
      in the page table in order to allow architectures to make sure the new
      mapping is visible.
      
      It could lead to a panic since on some architectures (like powerpc),
      the page table walker could see the wrong pte value and trigger a
      spurious page fault that can not be resolved (see commit f1cb8f9b
      ("powerpc/64s/radix: avoid ptesync after set_pte and
      ptep_set_access_flags")).
      
      But actually the patch is aiming at riscv: the riscv specification
      allows the caching of invalid entries in the TLB, and since we recently
      removed the vmalloc page fault handling, we now need to emit a tlb
      shootdown whenever a new vmalloc mapping is emitted
      (https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/).
      That's a temporary solution, there are ways to avoid that :)
      
      Link: https://lkml.kernel.org/r/20230809164633.1556126-1-alexghiti@rivosinc.com
      Fixes: 3e9a9e25
      
       ("mm: add a vmap_pfn function")
      Reported-by: default avatarDylan Jhong <dylan@andestech.com>
      Closes: https://lore.kernel.org/linux-riscv/ZMytNY2J8iyjbPPy@atctrx.andestech.com/
      Signed-off-by: default avatarAlexandre Ghiti <alexghiti@rivosinc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      Acked-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      Reviewed-by: default avatarDylan Jhong <dylan@andestech.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a50420c7
    • Ayush Jain's avatar
      selftests/mm: FOLL_LONGTERM need to be updated to 0x100 · 1738b949
      Ayush Jain authored
      After commit 2c224108 ("mm/gup: move private gup FOLL_ flags to
      internal.h") FOLL_LONGTERM flag value got updated from 0x10000 to 0x100 at
      include/linux/mm_types.h.
      
      As hmm.hmm_device_private.hmm_gup_test uses FOLL_LONGTERM Updating same
      here as well.
      
      Before this change test goes in an infinite assert loop in
      hmm.hmm_device_private.hmm_gup_test
      ==========================================================
       RUN           hmm.hmm_device_private.hmm_gup_test ...
      hmm-tests.c:1962:hmm_gup_test:Expected HMM_DMIRROR_PROT_WRITE..
      ..(2) == m[2] (34)
      hmm-tests.c:157:hmm_gup_test:Expected ret (-1) == 0 (0)
      hmm-tests.c:157:hmm_gup_test:Expected ret (-1) == 0 (0)
      ...
      ==========================================================
      
       Call Trace:
       <TASK>
       ? sched_clock+0xd/0x20
       ? __lock_acquire.constprop.0+0x120/0x6c0
       ? ktime_get+0x2c/0xd0
       ? sched_clock+0xd/0x20
       ? local_clock+0x12/0xd0
       ? lock_release+0x26e/0x3b0
       pin_user_pages_fast+0x4c/0x70
       gup_test_ioctl+0x4ff/0xbb0
       ? gup_test_ioctl+0x68c/0xbb0
       __x64_sys_ioctl+0x99/0xd0
       do_syscall_64+0x60/0x90
       ? syscall_exit_to_user_mode+0x2a/0x50
       ? do_syscall_64+0x6d/0x90
       ? syscall_exit_to_user_mode+0x2a/0x50
       ? do_syscall_64+0x6d/0x90
       ? irqentry_exit_to_user_mode+0xd/0x20
       ? irqentry_exit+0x3f/0x50
       ? exc_page_fault+0x96/0x200
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
       RIP: 0033:0x7f6aaa31aaff
      
      After this change test is able to pass successfully.
      
      Link: https://lkml.kernel.org/r/20230808124347.79163-1-ayush.jain3@amd.com
      Fixes: 2c224108
      
       ("mm/gup: move private gup FOLL_ flags to internal.h")
      Signed-off-by: default avatarAyush Jain <ayush.jain3@amd.com>
      Reviewed-by: default avatarRaghavendra K T <raghavendra.kt@amd.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1738b949
    • Ryusuke Konishi's avatar
      nilfs2: fix general protection fault in nilfs_lookup_dirty_data_buffers() · f83913f8
      Ryusuke Konishi authored
      
      
      A syzbot stress test reported that create_empty_buffers() called from
      nilfs_lookup_dirty_data_buffers() can cause a general protection fault.
      
      Analysis using its reproducer revealed that the back reference "mapping"
      from a page/folio has been changed to NULL after dirty page/folio gang
      lookup in nilfs_lookup_dirty_data_buffers().
      
      Fix this issue by excluding pages/folios from being collected if, after
      acquiring a lock on each page/folio, its back reference "mapping" differs
      from the pointer to the address space struct that held the page/folio.
      
      Link: https://lkml.kernel.org/r/20230805132038.6435-1-konishi.ryusuke@gmail.com
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Reported-by: default avatar <syzbot+0ad741797f4565e7e2d2@syzkaller.appspotmail.com>
      Closes: https://lkml.kernel.org/r/0000000000002930a705fc32b231@google.com
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f83913f8
    • David Hildenbrand's avatar
      mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via GUP-fast · 5805192c
      David Hildenbrand authored
      In contrast to most other GUP code, GUP-fast common page table walking
      code like gup_pte_range() also handles hugetlb pages.  But in contrast to
      other hugetlb page table walking code, it does not look at the hugetlb PTE
      abstraction whereby we have only a single logical hugetlb PTE per hugetlb
      page, even when using multiple cont-PTEs underneath -- which is for
      example what huge_ptep_get() abstracts.
      
      So when we have a hugetlb page that is mapped via cont-PTEs, GUP-fast
      might stumble over a PTE that does not map the head page of a hugetlb page
      -- not the first "head" PTE of such a cont mapping.
      
      Logically, the whole hugetlb page is mapped (entire_mapcount == 1), but we
      might end up calling gup_must_unshare() with a tail page of a hugetlb
      page.
      
      We only maintain a single PageAnonExclusive flag per hugetlb page (as
      hugetlb pages cannot get partially COW-shared), stored for the head page. 
      That flag is clear for all tail pages.
      
      So when gup_must_unshare() ends up calling PageAnonExclusive() with a tail
      page of a hugetlb page:
      
      1) With CONFIG_DEBUG_VM_PGFLAGS
      
      Stumbles over the:
      
      	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
      
      For example, when executing the COW selftests with 64k hugetlb pages on
      arm64:
      
        [   61.082187] page:00000000829819ff refcount:3 mapcount:1 mapping:0000000000000000 index:0x1 pfn:0x11ee11
        [   61.082842] head:0000000080f79bf7 order:4 entire_mapcount:1 nr_pages_mapped:0 pincount:2
        [   61.083384] anon flags: 0x17ffff80003000e(referenced|uptodate|dirty|head|mappedtodisk|node=0|zone=2|lastcpupid=0xfffff)
        [   61.084101] page_type: 0xffffffff()
        [   61.084332] raw: 017ffff800000000 fffffc00037b8401 0000000000000402 0000000200000000
        [   61.084840] raw: 0000000000000010 0000000000000000 00000000ffffffff 0000000000000000
        [   61.085359] head: 017ffff80003000e ffffd9e95b09b788 ffffd9e95b09b788 ffff0007ff63cf71
        [   61.085885] head: 0000000000000000 0000000000000002 00000003ffffffff 0000000000000000
        [   61.086415] page dumped because: VM_BUG_ON_PAGE(PageHuge(page) && !PageHead(page))
        [   61.086914] ------------[ cut here ]------------
        [   61.087220] kernel BUG at include/linux/page-flags.h:990!
        [   61.087591] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
        [   61.087999] Modules linked in: ...
        [   61.089404] CPU: 0 PID: 4612 Comm: cow Kdump: loaded Not tainted 6.5.0-rc4+ #3
        [   61.089917] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
        [   61.090409] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
        [   61.090897] pc : gup_must_unshare.part.0+0x64/0x98
        [   61.091242] lr : gup_must_unshare.part.0+0x64/0x98
        [   61.091592] sp : ffff8000825eb940
        [   61.091826] x29: ffff8000825eb940 x28: 0000000000000000 x27: fffffc00037b8440
        [   61.092329] x26: 0400000000000001 x25: 0000000000080101 x24: 0000000000080000
        [   61.092835] x23: 0000000000080100 x22: ffff0000cffb9588 x21: ffff0000c8ec6b58
        [   61.093341] x20: 0000ffffad6b1000 x19: fffffc00037b8440 x18: ffffffffffffffff
        [   61.093850] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548
        [   61.094358] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021
        [   61.094858] x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffffd9e958b7a1c0
        [   61.095359] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 00000000002bffa8
        [   61.095873] x5 : ffff0008bb19e708 x4 : 0000000000000000 x3 : 0000000000000000
        [   61.096380] x2 : 0000000000000000 x1 : ffff0000cf6636c0 x0 : 0000000000000046
        [   61.096894] Call trace:
        [   61.097080]  gup_must_unshare.part.0+0x64/0x98
        [   61.097392]  gup_pte_range+0x3a8/0x3f0
        [   61.097662]  gup_pgd_range+0x1ec/0x280
        [   61.097942]  lockless_pages_from_mm+0x64/0x1a0
        [   61.098258]  internal_get_user_pages_fast+0xe4/0x1d0
        [   61.098612]  pin_user_pages_fast+0x58/0x78
        [   61.098917]  pin_longterm_test_start+0xf4/0x2b8
        [   61.099243]  gup_test_ioctl+0x170/0x3b0
        [   61.099528]  __arm64_sys_ioctl+0xa8/0xf0
        [   61.099822]  invoke_syscall.constprop.0+0x7c/0xd0
        [   61.100160]  el0_svc_common.constprop.0+0xe8/0x100
        [   61.100500]  do_el0_svc+0x38/0xa0
        [   61.100736]  el0_svc+0x3c/0x198
        [   61.100971]  el0t_64_sync_handler+0x134/0x150
        [   61.101280]  el0t_64_sync+0x17c/0x180
        [   61.101543] Code: aa1303e0 f00074c1 912b0021 97fffeb2 (d4210000)
      
      2) Without CONFIG_DEBUG_VM_PGFLAGS
      
      Always detects "not exclusive" for passed tail pages and refuses to PIN
      the tail pages R/O, as gup_must_unshare() == true.  GUP-fast will fallback
      to ordinary GUP.  As ordinary GUP properly considers the logical hugetlb
      PTE abstraction in hugetlb_follow_page_mask(), pinning the page will
      succeed when looking at the PageAnonExclusive on the head page only.
      
      So the only real effect of this is that with cont-PTE hugetlb pages, we'll
      always fallback from GUP-fast to ordinary GUP when not working on the head
      page, which ends up checking the head page and do the right thing.
      
      Consequently, the cow selftests pass with cont-PTE hugetlb pages as well
      without CONFIG_DEBUG_VM_PGFLAGS.
      
      Note that this only applies to anon hugetlb pages that are mapped using
      cont-PTEs: for example 64k hugetlb pages on a 4k arm64 kernel.
      
      ... and only when R/O-pinning (FOLL_PIN) such pages that are mapped into
      the page table R/O using GUP-fast.
      
      On production kernels (and even most debug kernels, that don't set
      CONFIG_DEBUG_VM_PGFLAGS) this patch should theoretically not be required
      to be backported.  But of course, it does not hurt.
      
      Link: https://lkml.kernel.org/r/20230805101256.87306-1-david@redhat.com
      Fixes: a7f22660
      
       ("mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5805192c
    • Lucas Karpinski's avatar
      selftests: cgroup: fix test_kmem_basic less than error · 60439471
      Lucas Karpinski authored
      
      
      test_kmem_basic creates 100,000 negative dentries, with each one mapping
      to a slab object.  After memory.high is set, these are reclaimed through
      the shrink_slab function call which reclaims all 100,000 entries.  The
      test passes the majority of the time because when slab1 or current is
      calculated, it is often above 0, however, 0 is also an acceptable value.
      
      Link: https://lkml.kernel.org/r/7d6gcuyzdjcice6qbphrmpmv5skr5jtglg375unnjxqhstvhxc@qkn6dw6bao6v
      Signed-off-by: default avatarLucas Karpinski <lkarpins@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      60439471
    • Suren Baghdasaryan's avatar
      mm: enable page walking API to lock vmas during the walk · 49b06385
      Suren Baghdasaryan authored
      
      
      walk_page_range() and friends often operate under write-locked mmap_lock. 
      With introduction of vma locks, the vmas have to be locked as well during
      such walks to prevent concurrent page faults in these areas.  Add an
      additional member to mm_walk_ops to indicate locking requirements for the
      walk.
      
      The change ensures that page walks which prevent concurrent page faults
      by write-locking mmap_lock, operate correctly after introduction of
      per-vma locks.  With per-vma locks page faults can be handled under vma
      lock without taking mmap_lock at all, so write locking mmap_lock would
      not stop them.  The change ensures vmas are properly locked during such
      walks.
      
      A sample issue this solves is do_mbind() performing queue_pages_range()
      to queue pages for migration.  Without this change a concurrent page
      can be faulted into the area and be left out of migration.
      
      Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linuxfoundation.org>
      Suggested-by: default avatarJann Horn <jannh@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      49b06385
    • David Hildenbrand's avatar
      smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd() · 8b9c1cc0
      David Hildenbrand authored
      We shouldn't be using a GUP-internal helper if it can be avoided.
      
      Similar to smaps_pte_entry() that uses vm_normal_page(), let's use
      vm_normal_page_pmd() that similarly refuses to return the huge zeropage.
      
      In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd():
      
      (1) Will always return the head page, not a tail page of a THP.
      
       If we'd ever call smaps_account with a tail page while setting "compound
       = true", we could be in trouble, because smaps_account() would look at
       the memmap of unrelated pages.
      
       If we're unlucky, that memmap does not exist at all. Before we removed
       PG_doublemap, we could have triggered something similar as in
       commit 24d7275c ("fs/proc: task_mmu.c: don't read mapcount for
       migration entry").
      
       This can theoretically happen ever since commit ff9f47f6 ("mm: proc:
       smaps_rollup: do not stall write attempts on mmap_lock"):
      
        (a) We're in show_smaps_rollup() and processed a VMA
        (b) We release the mmap lock in show_smaps_rollup() because it is
            contended
        (c) We merged that VMA with another VMA
        (d) We collapsed a THP in that merged VMA at that position
      
       If the end address of the original VMA falls into the middle of a THP
       area, we would call smap_gather_stats() with a start address that falls
       into a PMD-mapped THP. It's probably very rare to trigger when not
       really forced.
      
      (2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page()
      
       Treat such PMDs here just like smaps_pte_entry() would treat such PTEs.
       If such pages would be anonymous, we most certainly would want to
       account them.
      
      (3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap()
      
       As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE
       pages. So just like smaps_pte_entry(), we'll now also ignore such PMD
       entries.
      
       Especially, follow_pmd_mask() never ends up calling
       follow_trans_huge_pmd() on pmd_devmap(). Instead it calls
       follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN
       is set.
      
       So skipping pmd_devmap() pages seems to be the right thing to do.
      
      (4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page()
      
       We won't be returning a memmap that should be ignored by core-mm, or
       worse, a memmap that does not even exist. Note that while
       walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't.
      
       Most probably this case doesn't currently really happen on the PMD level,
       otherwise we'd already be able to trigger kernel crashes when reading
       smaps / smaps_rollup.
      
      So most probably only (1) is relevant in practice as of now, but could only
      cause trouble in extreme corner cases.
      
      Let's move follow_trans_huge_pmd() to mm/internal.h to discourage future
      reuse in wrong context.
      
      Link: https://lkml.kernel.org/r/20230803143208.383663-3-david@redhat.com
      Fixes: ff9f47f6
      
       ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: liubo <liubo254@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8b9c1cc0
    • David Hildenbrand's avatar
      mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT · d74943a2
      David Hildenbrand authored
      Unfortunately commit 474098ed ("mm/gup: replace FOLL_NUMA by
      gup_can_follow_protnone()") missed that follow_page() and
      follow_trans_huge_pmd() never implicitly set FOLL_NUMA because they really
      don't want to fail on PROT_NONE-mapped pages -- either due to NUMA hinting
      or due to inaccessible (PROT_NONE) VMAs.
      
      As spelled out in commit 0b9d7052 ("mm: numa: Support NUMA hinting
      page faults from gup/gup_fast"): "Other follow_page callers like KSM
      should not use FOLL_NUMA, or they would fail to get the pages if they use
      follow_page instead of get_user_pages."
      
      liubo reported [1] that smaps_rollup results are imprecise, because they
      miss accounting of pages that are mapped PROT_NONE.  Further, it's easy to
      reproduce that KSM no longer works on inaccessible VMAs on x86-64, because
      pte_protnone()/pmd_protnone() also indictaes "true" in inaccessible VMAs,
      and follow_page() refuses to return such pages right now.
      
      As KVM really depends on these NUMA hinting faults, removing the
      pte_protnone()/pmd_protnone() handling in GUP code completely is not
      really an option.
      
      To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
      to restore the original behavior for now and add better comments.
      
      Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE in
      is_valid_gup_args(), to add that flag for all external GUP users.
      
      Note that there are three GUP-internal __get_user_pages() users that don't
      end up calling is_valid_gup_args() and consequently won't get
      FOLL_HONOR_NUMA_FAULT set.
      
      1) get_dump_page(): we really don't want to handle NUMA hinting
         faults. It specifies FOLL_FORCE and wouldn't have honored NUMA
         hinting faults already.
      2) populate_vma_page_range(): we really don't want to handle NUMA hinting
         faults. It specifies FOLL_FORCE on accessible VMAs, so it wouldn't have
         honored NUMA hinting faults already.
      3) faultin_vma_page_range(): we similarly don't want to handle NUMA
         hinting faults.
      
      To make the combination of FOLL_FORCE and FOLL_HONOR_NUMA_FAULT work in
      inaccessible VMAs properly, we have to perform VMA accessibility checks in
      gup_can_follow_protnone().
      
      As GUP-fast should reject such pages either way in
      pte_access_permitted()/pmd_access_permitted() -- for example on x86-64 and
      arm64 that both implement pte_protnone() -- let's just always fallback to
      ordinary GUP when stumbling over pte_protnone()/pmd_protnone().
      
      As Linus notes [2], honoring NUMA faults might only make sense for
      selected GUP users.
      
      So we should really see if we can instead let relevant GUP callers specify
      it manually, and not trigger NUMA hinting faults from GUP as default. 
      Prepare for that by making FOLL_HONOR_NUMA_FAULT an external GUP flag and
      adding appropriate documenation.
      
      While at it, remove a stale comment from follow_trans_huge_pmd(): That
      comment for pmd_protnone() was added in commit 2b4847e7 ("mm: numa:
      serialise parallel get_user_page against THP migration"), which noted:
      
      	THP does not unmap pages due to a lack of support for migration
      	entries at a PMD level.  This allows races with get_user_pages
      
      Nowadays, we do have PMD migration entries, so the comment no longer
      applies.  Let's drop it.
      
      [1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
      [2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com
      
      Link: https://lkml.kernel.org/r/20230803143208.383663-2-david@redhat.com
      Fixes: 474098ed
      
       ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarliubo <liubo254@huawei.com>
      Closes: https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
      Reported-by: default avatarPeter Xu <peterx@redhat.com>
      Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d74943a2
  3. Aug 05, 2023
    • SeongJae Park's avatar
      mm/damon/core: initialize damo_filter->list from damos_new_filter() · 5f1fc67f
      SeongJae Park authored
      damos_new_filter() is not initializing the list field of newly allocated
      filter object.  However, DAMON sysfs interface and DAMON_RECLAIM are not
      initializing it after calling damos_new_filter().  As a result, accessing
      uninitialized memory is possible.  Actually, adding multiple DAMOS filters
      via DAMON sysfs interface caused NULL pointer dereferencing.  Initialize
      the field just after the allocation from damos_new_filter().
      
      Link: https://lkml.kernel.org/r/20230729203733.38949-2-sj@kernel.org
      Fixes: 98def236
      
       ("mm/damon/core: implement damos filter")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5f1fc67f
    • Ryusuke Konishi's avatar
      nilfs2: fix use-after-free of nilfs_root in dirtying inodes via iput · f8654743
      Ryusuke Konishi authored
      During unmount process of nilfs2, nothing holds nilfs_root structure after
      nilfs2 detaches its writer in nilfs_detach_log_writer().  Previously,
      nilfs_evict_inode() could cause use-after-free read for nilfs_root if
      inodes are left in "garbage_list" and released by nilfs_dispose_list at
      the end of nilfs_detach_log_writer(), and this bug was fixed by commit
      9b5a04ac ("nilfs2: fix use-after-free bug of nilfs_root in
      nilfs_evict_inode()").
      
      However, it turned out that there is another possibility of UAF in the
      call path where mark_inode_dirty_sync() is called from iput():
      
      nilfs_detach_log_writer()
        nilfs_dispose_list()
          iput()
            mark_inode_dirty_sync()
              __mark_inode_dirty()
                nilfs_dirty_inode()
                  __nilfs_mark_inode_dirty()
                    nilfs_load_inode_block() --> causes UAF of nilfs_root struct
      
      This can happen after commit 0ae45f63 ("vfs: add support for a
      lazytime mount option"), which changed iput() to call
      mark_inode_dirty_sync() on its final reference if i_state has I_DIRTY_TIME
      flag and i_nlink is non-zero.
      
      This issue appears after commit 28a65b49 ("nilfs2: do not write dirty
      data after degenerating to read-only") when using the syzbot reproducer,
      but the issue has potentially existed before.
      
      Fix this issue by adding a "purging flag" to the nilfs structure, setting
      that flag while disposing the "garbage_list" and checking it in
      __nilfs_mark_inode_dirty().
      
      Unlike commit 9b5a04ac
      
       ("nilfs2: fix use-after-free bug of nilfs_root
      in nilfs_evict_inode()"), this patch does not rely on ns_writer to
      determine whether to skip operations, so as not to break recovery on
      mount.  The nilfs_salvage_orphan_logs routine dirties the buffer of
      salvaged data before attaching the log writer, so changing
      __nilfs_mark_inode_dirty() to skip the operation when ns_writer is NULL
      will cause recovery write to fail.  The purpose of using the cleanup-only
      flag is to allow for narrowing of such conditions.
      
      Link: https://lkml.kernel.org/r/20230728191318.33047-1-konishi.ryusuke@gmail.com
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Reported-by: default avatar <syzbot+74db8b3087f293d3a13a@syzkaller.appspotmail.com>
      Closes: https://lkml.kernel.org/r/000000000000b4e906060113fd63@google.com
      Fixes: 0ae45f63
      
       ("vfs: add support for a lazytime mount option")
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org> # 4.0+
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f8654743
    • Johannes Weiner's avatar
      selftests: cgroup: fix test_kmem_basic false positives · fac26502
      Johannes Weiner authored
      
      
      This test fails routinely in our prod testing environment, and I can
      reproduce it locally as well.
      
      The test allocates dcache inside a cgroup, then drops the memory limit
      and checks that usage drops correspondingly. The reason it fails is
      because dentries are freed with an RCU delay - a debugging sleep shows
      that usage drops as expected shortly after.
      
      Insert a 1s sleep after dropping the limit. This should be good
      enough, assuming that machines running those tests are otherwise not
      very busy.
      
      Link: https://lkml.kernel.org/r/20230801135632.1768830-1-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fac26502
    • Lorenzo Stoakes's avatar
      fs/proc/kcore: reinstate bounce buffer for KCORE_TEXT regions · 17457784
      Lorenzo Stoakes authored
      Some architectures do not populate the entire range categorised by
      KCORE_TEXT, so we must ensure that the kernel address we read from is
      valid.
      
      Unfortunately there is no solution currently available to do so with a
      purely iterator solution so reinstate the bounce buffer in this instance
      so we can use copy_from_kernel_nofault() in order to avoid page faults
      when regions are unmapped.
      
      This change partly reverts commit 2e1c0170 ("fs/proc/kcore: avoid
      bounce buffer for ktext data"), reinstating the bounce buffer, but adapts
      the code to continue to use an iterator.
      
      [lstoakes@gmail.com: correct comment to be strictly correct about reasoning]
        Link: https://lkml.kernel.org/r/525a3f14-74fa-4c22-9fca-9dab4de8a0c3@lucifer.local
      Link: https://lkml.kernel.org/r/20230731215021.70911-1-lstoakes@gmail.com
      Fixes: 2e1c0170
      
       ("fs/proc/kcore: avoid bounce buffer for ktext data")
      Signed-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reported-by: default avatarJiri Olsa <olsajiri@gmail.com>
      Closes: https://lore.kernel.org/all/ZHc2fm+9daF6cgCE@krava
      Tested-by: default avatarJiri Olsa <jolsa@kernel.org>
      Tested-by: default avatarWill Deacon <will@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Liu Shixin <liushixin2@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Thorsten Leemhuis <regressions@leemhuis.info>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      17457784
    • Liam R. Howlett's avatar
      MAINTAINERS: add maple tree mailing list · d1ef9dba
      Liam R. Howlett authored
      
      
      There is a mailing list for the maple tree development.  Add the list to
      the maple tree entry of the MAINTAINERS file so patches will be sent to
      interested parties.
      
      Link: https://lkml.kernel.org/r/20230731175542.1653200-1-Liam.Howlett@oracle.com
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d1ef9dba
    • Johannes Weiner's avatar
      mm: compaction: fix endless looping over same migrate block · 493614da
      Johannes Weiner authored
      During stress testing, the following situation was observed:
      
           70 root      39  19       0      0      0 R 100.0   0.0 959:29.92 khugepaged
       310936 root      20   0   84416  25620    512 R  99.7   1.5 642:37.22 hugealloc
      
      Tracing shows isolate_migratepages_block() endlessly looping over the
      first block in the DMA zone:
      
             hugealloc-310936  [001] ..... 237297.415718: mm_compaction_finished: node=0 zone=DMA      order=9 ret=no_suitable_page
             hugealloc-310936  [001] ..... 237297.415718: mm_compaction_isolate_migratepages: range=(0x1 ~ 0x400) nr_scanned=513 nr_taken=0
             hugealloc-310936  [001] ..... 237297.415718: mm_compaction_finished: node=0 zone=DMA      order=9 ret=no_suitable_page
             hugealloc-310936  [001] ..... 237297.415718: mm_compaction_isolate_migratepages: range=(0x1 ~ 0x400) nr_scanned=513 nr_taken=0
             hugealloc-310936  [001] ..... 237297.415718: mm_compaction_finished: node=0 zone=DMA      order=9 ret=no_suitable_page
             hugealloc-310936  [001] ..... 237297.415718: mm_compaction_isolate_migratepages: range=(0x1 ~ 0x400) nr_scanned=513 nr_taken=0
             hugealloc-310936  [001] ..... 237297.415718: mm_compaction_finished: node=0 zone=DMA      order=9 ret=no_suitable_page
             hugealloc-310936  [001] ..... 237297.415718: mm_compaction_isolate_migratepages: range=(0x1 ~ 0x400) nr_scanned=513 nr_taken=0
      
      The problem is that the functions tries to test and set the skip bit once
      on the block, to avoid skipping on its own skip-set, using
      pageblock_aligned() on the pfn as a test.  But because this is the DMA
      zone which starts at pfn 1, this is never true for the first block, and
      the skip bit isn't set or tested at all.  As a result,
      fast_find_migrateblock() returns the same pageblock over and over.
      
      If the pfn isn't pageblock-aligned, also check if it's the start of the
      zone to ensure test-and-set-exactly-once on unaligned ranges.
      
      Thanks to Vlastimil Babka for the help in debugging this.
      
      Link: https://lkml.kernel.org/r/20230731172450.1632195-1-hannes@cmpxchg.org
      Fixes: 90ed667c
      
       ("Revert "Revert "mm/compaction: fix set skip in fast_find_migrateblock""")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      493614da
    • Ayush Jain's avatar
      selftests: mm: ksm: fix incorrect evaluation of parameter · 65294de3
      Ayush Jain authored
      A missing break in kms_tests leads to kselftest hang when the parameter -s
      is used.
      
      In current code flow because of missing break in -s, -t parses args
      spilled from -s and as -t accepts only valid values as 0,1 so any arg in
      -s >1 or <0, gets in ksm_test failure
      
      This went undetected since, before the addition of option -t, the next
      case -M would immediately break out of the switch statement but that is no
      longer the case
      
      Add the missing break statement.
      
      ----Before----
      ./ksm_tests -H -s 100
      Invalid merge type
      
      ----After----
      ./ksm_tests -H -s 100
      Number of normal pages:    0
      Number of huge pages:    50
      Total size:    100 MiB
      Total time:    0.401732682 s
      Average speed:  248.922 MiB/s
      
      Link: https://lkml.kernel.org/r/20230728163952.4634-1-ayush.jain3@amd.com
      Fixes: 07115fcc
      
       ("selftests/mm: add new selftests for KSM")
      Signed-off-by: default avatarAyush Jain <ayush.jain3@amd.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Stefan Roesch <shr@devkernel.io>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      65294de3
    • Mike Kravetz's avatar
      hugetlb: do not clear hugetlb dtor until allocating vmemmap · 32c87719
      Mike Kravetz authored
      Patch series "Fix hugetlb free path race with memory errors".
      
      In the discussion of Jiaqi Yan's series "Improve hugetlbfs read on
      HWPOISON hugepages" the race window was discovered. 
      https://lore.kernel.org/linux-mm/20230616233447.GB7371@monkey/
      
      Freeing a hugetlb page back to low level memory allocators is performed
      in two steps.
      1) Under hugetlb lock, remove page from hugetlb lists and clear destructor
      2) Outside lock, allocate vmemmap if necessary and call low level free
      Between these two steps, the hugetlb page will appear as a normal
      compound page.  However, vmemmap for tail pages could be missing.
      If a memory error occurs at this time, we could try to update page
      flags non-existant page structs.
      
      A much more detailed description is in the first patch.
      
      The first patch addresses the race window.  However, it adds a
      hugetlb_lock lock/unlock cycle to every vmemmap optimized hugetlb page
      free operation.  This could lead to slowdowns if one is freeing a large
      number of hugetlb pages.
      
      The second path optimizes the update_and_free_pages_bulk routine to only
      take the lock once in bulk operations.
      
      The second patch is technically not a bug fix, but includes a Fixes tag
      and Cc stable to avoid a performance regression.  It can be combined with
      the first, but was done separately make reviewing easier.
      
      
      This patch (of 2):
      
      Freeing a hugetlb page and releasing base pages back to the underlying
      allocator such as buddy or cma is performed in two steps:
      - remove_hugetlb_folio() is called to remove the folio from hugetlb
        lists, get a ref on the page and remove hugetlb destructor.  This
        all must be done under the hugetlb lock.  After this call, the page
        can be treated as a normal compound page or a collection of base
        size pages.
      - update_and_free_hugetlb_folio() is called to allocate vmemmap if
        needed and the free routine of the underlying allocator is called
        on the resulting page.  We can not hold the hugetlb lock here.
      
      One issue with this scheme is that a memory error could occur between
      these two steps.  In this case, the memory error handling code treats
      the old hugetlb page as a normal compound page or collection of base
      pages.  It will then try to SetPageHWPoison(page) on the page with an
      error.  If the page with error is a tail page without vmemmap, a write
      error will occur when trying to set the flag.
      
      Address this issue by modifying remove_hugetlb_folio() and
      update_and_free_hugetlb_folio() such that the hugetlb destructor is not
      cleared until after allocating vmemmap.  Since clearing the destructor
      requires holding the hugetlb lock, the clearing is done in
      remove_hugetlb_folio() if the vmemmap is present.  This saves a
      lock/unlock cycle.  Otherwise, destructor is cleared in
      update_and_free_hugetlb_folio() after allocating vmemmap.
      
      Note that this will leave hugetlb pages in a state where they are marked
      free (by hugetlb specific page flag) and have a ref count.  This is not
      a normal state.  The only code that would notice is the memory error
      code, and it is set up to retry in such a case.
      
      A subsequent patch will create a routine to do bulk processing of
      vmemmap allocation.  This will eliminate a lock/unlock cycle for each
      hugetlb page in the case where we are freeing a large number of pages.
      
      Link: https://lkml.kernel.org/r/20230711220942.43706-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20230711220942.43706-2-mike.kravetz@oracle.com
      Fixes: ad2fa371
      
       ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Tested-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      32c87719
    • Miaohe Lin's avatar
      mm: memory-failure: avoid false hwpoison page mapped error info · faeb2ff2
      Miaohe Lin authored
      folio->_mapcount is overloaded in SLAB, so folio_mapped() has to be done
      after folio_test_slab() is checked. Otherwise slab folio might be treated
      as a mapped folio leading to false 'Someone maps the hwpoison page' error
      info.
      
      Link: https://lkml.kernel.org/r/20230727115643.639741-4-linmiaohe@huawei.com
      Fixes: 230ac719
      
       ("mm/hwpoison: don't try to unpoison containment-failed pages")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      faeb2ff2
    • Miaohe Lin's avatar
      mm: memory-failure: fix potential unexpected return value from unpoison_memory() · f29623e4
      Miaohe Lin authored
      If unpoison_memory() fails to clear page hwpoisoned flag, return value ret
      is expected to be -EBUSY.  But when get_hwpoison_page() returns 1 and
      fails to clear page hwpoisoned flag due to races, return value will be
      unexpected 1 leading to users being confused.  And there's a code smell
      that the variable "ret" is used not only to save the return value of
      unpoison_memory(), but also the return value from get_hwpoison_page(). 
      Make a further cleanup by using another auto-variable solely to save the
      return value of get_hwpoison_page() as suggested by Naoya.
      
      Link: https://lkml.kernel.org/r/20230727115643.639741-3-linmiaohe@huawei.com
      Fixes: bf181c58
      
       ("mm/hwpoison: fix unpoison_memory()")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f29623e4
    • Miaohe Lin's avatar
      mm/swapfile: fix wrong swap entry type for hwpoisoned swapcache page · f985fc32
      Miaohe Lin authored
      Patch series "A few fixup patches for mm", v2.
      
      This series contains a few fixup patches to fix potential unexpected
      return value, fix wrong swap entry type for hwpoisoned swapcache page and
      so on.  More details can be found in the respective changelogs.
      
      
      This patch (of 3):
      
      Hwpoisoned dirty swap cache page is kept in the swap cache and there's
      simple interception code in do_swap_page() to catch it.  But when trying
      to swapoff, unuse_pte() will wrongly install a general sense of "future
      accesses are invalid" swap entry for hwpoisoned swap cache page due to
      unaware of such type of page.  The user will receive SIGBUS signal without
      expected BUS_MCEERR_AR payload.  BTW, typo 'hwposioned' is fixed.
      
      Link: https://lkml.kernel.org/r/20230727115643.639741-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20230727115643.639741-2-linmiaohe@huawei.com
      Fixes: 6b970599
      
       ("mm: hwpoison: support recovery from ksm_might_need_to_copy()")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f985fc32
    • Colin Ian King's avatar
      radix tree test suite: fix incorrect allocation size for pthreads · cac7ea57
      Colin Ian King authored
      Currently the pthread allocation for each array item is based on the size
      of a pthread_t pointer and should be the size of the pthread_t structure,
      so the allocation is under-allocating the correct size.  Fix this by using
      the size of each element in the pthreads array.
      
      Static analysis cppcheck reported:
      tools/testing/radix-tree/regression1.c:180:2: warning: Size of pointer
      'threads' used instead of size of its data. [pointerSize]
      
      Link: https://lkml.kernel.org/r/20230727160930.632674-1-colin.i.king@gmail.com
      Fixes: 1366c37e
      
       ("radix tree test harness")
      Signed-off-by: default avatarColin Ian King <colin.i.king@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cac7ea57
    • David Howells's avatar
      crypto, cifs: fix error handling in extract_iter_to_sg() · f443fd5a
      David Howells authored
      Fix error handling in extract_iter_to_sg().  Pages need to be unpinned, not
      put in extract_user_to_sg() when handling IOVEC/UBUF sources.
      
      The bug may result in a warning like the following:
      
        WARNING: CPU: 1 PID: 20384 at mm/gup.c:229 __lse_atomic_add arch/arm64/include/asm/atomic_lse.h:27 [inline]
        WARNING: CPU: 1 PID: 20384 at mm/gup.c:229 arch_atomic_add arch/arm64/include/asm/atomic.h:28 [inline]
        WARNING: CPU: 1 PID: 20384 at mm/gup.c:229 raw_atomic_add include/linux/atomic/atomic-arch-fallback.h:537 [inline]
        WARNING: CPU: 1 PID: 20384 at mm/gup.c:229 atomic_add include/linux/atomic/atomic-instrumented.h:105 [inline]
        WARNING: CPU: 1 PID: 20384 at mm/gup.c:229 try_grab_page+0x108/0x160 mm/gup.c:252
        ...
        pc : try_grab_page+0x108/0x160 mm/gup.c:229
        lr : follow_page_pte+0x174/0x3e4 mm/gup.c:651
        ...
        Call trace:
         __lse_atomic_add arch/arm64/include/asm/atomic_lse.h:27 [inline]
         arch_atomic_add arch/arm64/include/asm/atomic.h:28 [inline]
         raw_atomic_add include/linux/atomic/atomic-arch-fallback.h:537 [inline]
         atomic_add include/linux/atomic/atomic-instrumented.h:105 [inline]
         try_grab_page+0x108/0x160 mm/gup.c:252
         follow_pmd_mask mm/gup.c:734 [inline]
         follow_pud_mask mm/gup.c:765 [inline]
         follow_p4d_mask mm/gup.c:782 [inline]
         follow_page_mask+0x12c/0x2e4 mm/gup.c:839
         __get_user_pages+0x174/0x30c mm/gup.c:1217
         __get_user_pages_locked mm/gup.c:1448 [inline]
         __gup_longterm_locked+0x94/0x8f4 mm/gup.c:2142
         internal_get_user_pages_fast+0x970/0xb60 mm/gup.c:3140
         pin_user_pages_fast+0x4c/0x60 mm/gup.c:3246
         iov_iter_extract_user_pages lib/iov_iter.c:1768 [inline]
         iov_iter_extract_pages+0xc8/0x54c lib/iov_iter.c:1831
         extract_user_to_sg lib/scatterlist.c:1123 [inline]
         extract_iter_to_sg lib/scatterlist.c:1349 [inline]
         extract_iter_to_sg+0x26c/0x6fc lib/scatterlist.c:1339
         hash_sendmsg+0xc0/0x43c crypto/algif_hash.c:117
         sock_sendmsg_nosec net/socket.c:725 [inline]
         sock_sendmsg+0x54/0x60 net/socket.c:748
         ____sys_sendmsg+0x270/0x2ac net/socket.c:2494
         ___sys_sendmsg+0x80/0xdc net/socket.c:2548
         __sys_sendmsg+0x68/0xc4 net/socket.c:2577
         __do_sys_sendmsg net/socket.c:2586 [inline]
         __se_sys_sendmsg net/socket.c:2584 [inline]
         __arm64_sys_sendmsg+0x24/0x30 net/socket.c:2584
         __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
         invoke_syscall+0x48/0x114 arch/arm64/kernel/syscall.c:52
         el0_svc_common.constprop.0+0x44/0xe4 arch/arm64/kernel/syscall.c:142
         do_el0_svc+0x38/0xa4 arch/arm64/kernel/syscall.c:191
         el0_svc+0x2c/0xb0 arch/arm64/kernel/entry-common.c:647
         el0t_64_sync_handler+0xc0/0xc4 arch/arm64/kernel/entry-common.c:665
         el0t_64_sync+0x19c/0x1a0 arch/arm64/kernel/entry.S:591
      
      Link: https://lkml.kernel.org/r/20571.1690369076@warthog.procyon.org.uk
      Fixes: 01858469
      
       ("netfs: Add a function to extract an iterator into a scatterlist")
      Reported-by: default avatar <syzbot+9b82859567f2e50c123e@syzkaller.appspotmail.com>
      Link: https://lore.kernel.org/linux-mm/000000000000273d0105ff97bf56@google.com/
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarSteve French <stfrench@microsoft.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Shyam Prasad N <nspmangalore@gmail.com>
      Cc: Rohith Surabattula <rohiths.msft@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f443fd5a
    • Andrew Yang's avatar
      zsmalloc: fix races between modifications of fullness and isolated · 4b5d1e47
      Andrew Yang authored
      We encountered many kernel exceptions of VM_BUG_ON(zspage->isolated ==
      0) in dec_zspage_isolation() and BUG_ON(!pages[1]) in zs_unmap_object()
      lately.  This issue only occurs when migration and reclamation occur at
      the same time.
      
      With our memory stress test, we can reproduce this issue several times
      a day.  We have no idea why no one else encountered this issue.  BTW,
      we switched to the new kernel version with this defect a few months
      ago.
      
      Since fullness and isolated share the same unsigned int, modifications of
      them should be protected by the same lock.
      
      [andrew.yang@mediatek.com: move comment]
        Link: https://lkml.kernel.org/r/20230727062910.6337-1-andrew.yang@mediatek.com
      Link: https://lkml.kernel.org/r/20230721063705.11455-1-andrew.yang@mediatek.com
      Fixes: c4549b87
      
       ("zsmalloc: remove zspage isolation for migration")
      Signed-off-by: default avatarAndrew Yang <andrew.yang@mediatek.com>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4b5d1e47
  4. Jul 31, 2023
    • Linus Torvalds's avatar
      Linux 6.5-rc4 · 5d0c230f
      Linus Torvalds authored
      5d0c230f
    • Linus Torvalds's avatar
      Merge tag 'spi-fix-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi · d5bb4b89
      Linus Torvalds authored
      Pull spi fixes from Mark Brown:
       "A bunch of fixes for the Qualcomm QSPI driver, fixing multiple issues
        with the newly added DMA mode - it had a number of issues exposed when
        tested in a wider range of use cases, both race condition style issues
        and issues with different inputs to those that had been used in test"
      
      * tag 'spi-fix-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
        spi: spi-qcom-qspi: Add mem_ops to avoid PIO for badly sized reads
        spi: spi-qcom-qspi: Fallback to PIO for xfers that aren't multiples of 4 bytes
        spi: spi-qcom-qspi: Add DMA_CHAIN_DONE to ALL_IRQS
        spi: spi-qcom-qspi: Call dma_wmb() after setting up descriptors
        spi: spi-qcom-qspi: Use GFP_ATOMIC flag while allocating for descriptor
        spi: spi-qcom-qspi: Ignore disabled interrupts' status in isr
      d5bb4b89
    • Linus Torvalds's avatar
      Merge tag 'regulator-fix-v6.5-rc3' of... · 3dfe6886
      Linus Torvalds authored
      Merge tag 'regulator-fix-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
      
      Pull regulator fixes from Mark Brown:
       "A couple of small fixes for the the mt6358 driver, fixing error
        reporting and a bootstrapping issue"
      
      * tag 'regulator-fix-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
        regulator: mt6358: Fix incorrect VCN33 sync error message
        regulator: mt6358: Sync VCN33_* enable status after checking ID
      3dfe6886
    • Linus Torvalds's avatar
      Merge tag 'usb-6.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 88f66f13
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are a set of USB driver fixes for 6.5-rc4. Include in here are:
      
         - new USB serial device ids
      
         - dwc3 driver fixes for reported issues
      
         - typec driver fixes for reported problems
      
         - gadget driver fixes
      
         - reverts of some problematic USB changes that went into -rc1
      
        All of these have been in linux-next with no reported problems"
      
      * tag 'usb-6.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (24 commits)
        usb: misc: ehset: fix wrong if condition
        usb: dwc3: pci: skip BYT GPIO lookup table for hardwired phy
        usb: cdns3: fix incorrect calculation of ep_buf_size when more than one config
        usb: gadget: call usb_gadget_check_config() to verify UDC capability
        usb: typec: Use sysfs_emit_at when concatenating the string
        usb: typec: Iterate pds array when showing the pd list
        usb: typec: Set port->pd before adding device for typec_port
        usb: typec: qcom: fix return value check in qcom_pmic_typec_probe()
        Revert "usb: gadget: tegra-xudc: Fix error check in tegra_xudc_powerdomain_init()"
        Revert "usb: xhci: tegra: Fix error check"
        USB: gadget: Fix the memory leak in raw_gadget driver
        usb: gadget: core: remove unbalanced mutex_unlock in usb_gadget_activate
        Revert "usb: dwc3: core: Enable AutoRetry feature in the controller"
        Revert "xhci: add quirk for host controllers that don't update endpoint DCS"
        USB: quirks: add quirk for Focusrite Scarlett
        usb: xhci-mtk: set the dma max_seg_size
        MAINTAINERS: drop invalid usb/cdns3 Reviewer e-mail
        usb: dwc3: don't reset device side if dwc3 was configured as host-only
        usb: typec: ucsi: move typec_set_mode(TYPEC_STATE_SAFE) to ucsi_unregister_partner()
        usb: ohci-at91: Fix the unhandle interrupt when resume
        ...
      88f66f13
    • Linus Torvalds's avatar
      Merge tag 'tty-6.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · e6d34ced
      Linus Torvalds authored
      Pull tty/serial fixes from Greg KH:
       "Here are some small TTY and serial driver fixes for 6.5-rc4 for some
        reported problems. Included in here is:
      
         - TIOCSTI fix for braille readers
      
         - documentation fix for minor numbers
      
         - MAINTAINERS update for new serial files in -rc1
      
         - minor serial driver fixes for reported problems
      
        All of these have been in linux-next with no reported problems"
      
      * tag 'tty-6.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        serial: 8250_dw: Preserve original value of DLF register
        tty: serial: sh-sci: Fix sleeping in atomic context
        serial: sifive: Fix sifive_serial_console_setup() section
        Documentation: devices.txt: reconcile serial/ucc_uart minor numers
        MAINTAINERS: Update TTY layer for lists and recently added files
        tty: n_gsm: fix UAF in gsm_cleanup_mux
        TIOCSTI: always enable for CAP_SYS_ADMIN
      e6d34ced
    • Linus Torvalds's avatar
      Merge tag 'staging-6.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 3d6b77a8
      Linus Torvalds authored
      Pull staging driver fixes from Greg KH:
       "Here are three small staging driver fixes for 6.5-rc4 that resolve
        some reported problems. These fixes are:
      
         - fix for an old bug in the r8712 driver
      
         - fbtft driver fix for a spi device
      
         - potential overflow fix in the ks7010 driver
      
        All of these have been in linux-next with no reported problems"
      
      * tag 'staging-6.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: ks7010: potential buffer overflow in ks_wlan_set_encode_ext()
        staging: fbtft: ili9341: use macro FBTFT_REGISTER_SPI_DRIVER
        staging: r8712: Fix memory leak in _r8712_init_xmit_priv()
      3d6b77a8
    • Linus Torvalds's avatar
      Merge tag 'char-misc-6.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · cf270e7b
      Linus Torvalds authored
      Pull char driver and Documentation fixes from Greg KH:
       "Here is a char driver fix and some documentation updates for 6.5-rc4
        that contain the following changes:
      
         - sram/genalloc bugfix for reported problem
      
         - security-bugs.rst update based on recent discussions
      
         - embargoed-hardware-issues minor cleanups and then partial revert
           for the project/company lists
      
        All of these have been in linux-next for a while with no reported
        problems, and the documentation updates have all been reviewed by the
        relevant developers"
      
      * tag 'char-misc-6.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        misc/genalloc: Name subpools by of_node_full_name()
        Documentation: embargoed-hardware-issues.rst: add AMD to the list
        Documentation: embargoed-hardware-issues.rst: clean out empty and unused entries
        Documentation: security-bugs.rst: clarify CVE handling
        Documentation: security-bugs.rst: update preferences when dealing with the linux-distros group
      cf270e7b
    • Linus Torvalds's avatar
      Merge tag 'probes-fixes-v6.5-rc3' of... · b0b9850e
      Linus Torvalds authored
      Merge tag 'probes-fixes-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
      
      Pull probe fixes from Masami Hiramatsu:
      
       - probe-events: add NULL check for some BTF API calls which can return
         error code and NULL.
      
       - ftrace selftests: check fprobe and kprobe event correctly. This fixes
         a miss condition of the test command.
      
       - kprobes: do not allow probing functions that start with "__cfi_" or
         "__pfx_" since those are auto generated for kernel CFI and not
         executed.
      
      * tag 'probes-fixes-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        kprobes: Prohibit probing on CFI preamble symbol
        selftests/ftrace: Fix to check fprobe event eneblement
        tracing/probes: Fix to add NULL check for BTF APIs
      b0b9850e
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 98a05fe8
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "x86:
      
         - Do not register IRQ bypass consumer if posted interrupts not
           supported
      
         - Fix missed device interrupt due to non-atomic update of IRR
      
         - Use GFP_KERNEL_ACCOUNT for pid_table in ipiv
      
         - Make VMREAD error path play nice with noinstr
      
         - x86: Acquire SRCU read lock when handling fastpath MSR writes
      
         - Support linking rseq tests statically against glibc 2.35+
      
         - Fix reference count for stats file descriptors
      
         - Detect userspace setting invalid CR0
      
        Non-KVM:
      
         - Remove coccinelle script that has caused multiple confusion
           ("debugfs, coccinelle: check for obsolete DEFINE_SIMPLE_ATTRIBUTE()
           usage", acked by Greg)"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
        KVM: selftests: Expand x86's sregs test to cover illegal CR0 values
        KVM: VMX: Don't fudge CR0 and CR4 for restricted L2 guest
        KVM: x86: Disallow KVM_SET_SREGS{2} if incoming CR0 is invalid
        Revert "debugfs, coccinelle: check for obsolete DEFINE_SIMPLE_ATTRIBUTE() usage"
        KVM: selftests: Verify stats fd is usable after VM fd has been closed
        KVM: selftests: Verify stats fd can be dup()'d and read
        KVM: selftests: Verify userspace can create "redundant" binary stats files
        KVM: selftests: Explicitly free vcpus array in binary stats test
        KVM: selftests: Clean up stats fd in common stats_test() helper
        KVM: selftests: Use pread() to read binary stats header
        KVM: Grab a reference to KVM for VM and vCPU stats file descriptors
        selftests/rseq: Play nice with binaries statically linked against glibc 2.35+
        Revert "KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"
        KVM: x86: Acquire SRCU read lock when handling fastpath MSR writes
        KVM: VMX: Use vmread_error() to report VM-Fail in "goto" path
        KVM: VMX: Make VMREAD error path play nice with noinstr
        KVM: x86/irq: Conditionally register IRQ bypass consumer again
        KVM: X86: Use GFP_KERNEL_ACCOUNT for pid_table in ipiv
        KVM: x86: check the kvm_cpu_get_interrupt result before using it
        KVM: x86: VMX: set irr_pending in kvm_apic_update_irr
        ...
      98a05fe8
    • Linus Torvalds's avatar
      Merge tag 'locking_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c959e900
      Linus Torvalds authored
      Pull locking fix from Borislav Petkov:
      
       - Fix a rtmutex race condition resulting from sharing of the sort key
         between the lock waiters and the PI chain tree (->pi_waiters) of a
         task by giving each tree their own sort key
      
      * tag 'locking_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        locking/rtmutex: Fix task->pi_waiters integrity
      c959e900
    • Linus Torvalds's avatar
      Merge tag 'x86_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · d410b62e
      Linus Torvalds authored
      Pull x86 fixes from Borislav Petkov:
      
       - AMD's automatic IBRS doesn't enable cross-thread branch target
         injection protection (STIBP) for user processes. Enable STIBP on such
         systems.
      
       - Do not delete (but put the ref instead) of AMD MCE error thresholding
         sysfs kobjects when destroying them in order not to delete the kernfs
         pointer prematurely
      
       - Restore annotation in ret_from_fork_asm() in order to fix kthread
         stack unwinding from being marked as unreliable and thus breaking
         livepatching
      
      * tag 'x86_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/cpu: Enable STIBP on AMD if Automatic IBRS is enabled
        x86/MCE/AMD: Decrement threshold_bank refcount when removing threshold blocks
        x86: Fix kthread unwind
      d410b62e
    • Linus Torvalds's avatar
      Merge tag 'irq_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · eb9fe179
      Linus Torvalds authored
      Pull irq fixes from Borislav Petkov:
      
       - Work around an erratum on GIC700, where a race between a CPU handling
         a wake-up interrupt, a change of affinity, and another CPU going to
         sleep can result in a lack of wake-up event on the next interrupt
      
       - Fix the locking required on a VPE for GICv4
      
       - Enable Rockchip 3588001 erratum workaround for RK3588S
      
       - Fix the irq-bcm6345-l1 assumtions of the boot CPU always be the first
         CPU in the system
      
      * tag 'irq_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/gic-v3: Workaround for GIC-700 erratum 2941627
        irqchip/gic-v3: Enable Rockchip 3588001 erratum workaround for RK3588S
        irqchip/gic-v4.1: Properly lock VPEs when doing a directLPI invalidation
        irq-bcm6345-l1: Do not assume a fixed block to cpu mapping
      eb9fe179
  5. Jul 30, 2023
    • Linus Torvalds's avatar
      Merge tag '6.5-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 · d31e3792
      Linus Torvalds authored
      Pull smb client fixes from Steve French:
       "Four small SMB3 client fixes:
      
         - two reconnect fixes (to address the case where non-default
           iocharset gets incorrectly overridden at reconnect with the
           default charset)
      
         - fix for NTLMSSP_AUTH request setting a flag incorrectly)
      
         - Add missing check for invalid tlink (tree connection) in ioctl"
      
      * tag '6.5-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
        cifs: add missing return value check for cifs_sb_tlink
        smb3: do not set NTLMSSP_VERSION flag for negotiate not auth request
        cifs: fix charset issue in reconnection
        fs/nls: make load_nls() take a const parameter
      d31e3792