Skip to content
  1. Oct 05, 2023
  2. Oct 02, 2023
  3. Oct 01, 2023
  4. Sep 30, 2023
    • Baoquan He's avatar
      Crash: add lock to serialize crash hotplug handling · e2a8f20d
      Baoquan He authored
      Eric reported that handling corresponding crash hotplug event can be
      failed easily when many memory hotplug event are notified in a short
      period.  They failed because failing to take __kexec_lock.
      
      =======
      [   78.714569] Fallback order for Node 0: 0
      [   78.714575] Built 1 zonelists, mobility grouping on.  Total pages: 1817886
      [   78.717133] Policy zone: Normal
      [   78.724423] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
      [   78.727207] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
      [   80.056643] PEFILE: Unsigned PE binary
      =======
      
      The memory hotplug events are notified very quickly and very many, while
      the handling of crash hotplug is much slower relatively.  So the atomic
      variable __kexec_lock and kexec_trylock() can't guarantee the
      serialization of crash hotplug handling.
      
      Here, add a new mutex lock __crash_hotplug_lock to serialize crash hotplug
      handling specifically.  This doesn't impact the usage of __kexec_lock.
      
      Link: https://lkml.kernel.org/r/20230926120905.392903-1-bhe@redhat.com
      Fixes: 24726275
      
       ("crash: add generic infrastructure for crash hotplug support")
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Tested-by: default avatarEric DeVolder <eric.devolder@oracle.com>
      Reviewed-by: default avatarEric DeVolder <eric.devolder@oracle.com>
      Reviewed-by: default avatarValentin Schneider <vschneid@redhat.com>
      Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e2a8f20d
    • Juntong Deng's avatar
      selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and... · bbe246f8
      Juntong Deng authored
      selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh that may cause error
      
      According to the awk manual, the -e option does not need to be specified
      in front of 'program' (unless you need to mix program-file).
      
      The redundant -e option can cause error when users use awk tools other
      than gawk (for example, mawk does not support the -e option).
      
      Error Example:
      awk: not an option: -e
      
      Link: https://lkml.kernel.org/r/VI1P193MB075228810591AF2FDD7D42C599C3A@VI1P193MB0752.EURP193.PROD.OUTLOOK.COM
      
      
      Signed-off-by: default avatarJuntong Deng <juntong.deng@outlook.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bbe246f8
    • Yang Shi's avatar
      mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified · 24526268
      Yang Shi authored
      When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT, kernel
      should attempt to migrate all existing pages, and return -EIO if there is
      misplaced or unmovable page.  Then commit 6f4576e3 ("mempolicy: apply
      page table walker on queue_pages_range()") messed up the return value and
      didn't break VMA scan early ianymore when MPOL_MF_STRICT alone.  The
      return value problem was fixed by commit a7f40cfe ("mm: mempolicy:
      make mbind() return -EIO when MPOL_MF_STRICT is specified"), but it broke
      the VMA walk early if unmovable page is met, it may cause some pages are
      not migrated as expected.
      
      The code should conceptually do:
      
       if (MPOL_MF_MOVE|MOVEALL)
           scan all vmas
           try to migrate the existing pages
           return success
       else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
           scan all vmas
           try to migrate the existing pages
           return -EIO if unmovable or migration failed
       else /* MPOL_MF_STRICT alone */
           break early if meets unmovable and don't call mbind_range() at all
       else /* none of those flags */
           check the ranges in test_walk, EFAULT without mbind_range() if discontig.
      
      Fixed the behavior.
      
      Link: https://lkml.kernel.org/r/20230920223242.3425775-1-yang@os.amperecomputing.com
      Fixes: a7f40cfe
      
       ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
      Signed-off-by: default avatarYang Shi <yang@os.amperecomputing.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[4.9+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      24526268
    • Jinjie Ruan's avatar
      mm/damon/vaddr-test: fix memory leak in damon_do_test_apply_three_regions() · 45120b15
      Jinjie Ruan authored
      When CONFIG_DAMON_VADDR_KUNIT_TEST=y and making CONFIG_DEBUG_KMEMLEAK=y
      and CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y, the below memory leak is detected.
      
      Since commit 9f86d624 ("mm/damon/vaddr-test: remove unnecessary
      variables"), the damon_destroy_ctx() is removed, but still call
      damon_new_target() and damon_new_region(), the damon_region which is
      allocated by kmem_cache_alloc() in damon_new_region() and the damon_target
      which is allocated by kmalloc in damon_new_target() are not freed.  And
      the damon_region which is allocated in damon_new_region() in
      damon_set_regions() is also not freed.
      
      So use damon_destroy_target to free all the damon_regions and damon_target.
      
          unreferenced object 0xffff888107c9a940 (size 64):
            comm "kunit_try_catch", pid 1069, jiffies 4294670592 (age 732.761s)
            hex dump (first 32 bytes):
              00 00 00 00 00 00 00 00 06 00 00 00 6b 6b 6b 6b  ............kkkk
              60 c7 9c 07 81 88 ff ff f8 cb 9c 07 81 88 ff ff  `...............
            backtrace:
              [<ffffffff817e0167>] kmalloc_trace+0x27/0xa0
              [<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0
              [<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0
              [<ffffffff819c82be>] damon_test_apply_three_regions1+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff8881079cc740 (size 56):
            comm "kunit_try_catch", pid 1069, jiffies 4294670592 (age 732.761s)
            hex dump (first 32 bytes):
              05 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00  ................
              6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b  kkkkkkkk....kkkk
            backtrace:
              [<ffffffff819bc492>] damon_new_region+0x22/0x1c0
              [<ffffffff819c7d91>] damon_do_test_apply_three_regions.constprop.0+0xd1/0x3e0
              [<ffffffff819c82be>] damon_test_apply_three_regions1+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff888107c9ac40 (size 64):
            comm "kunit_try_catch", pid 1071, jiffies 4294670595 (age 732.843s)
            hex dump (first 32 bytes):
              00 00 00 00 00 00 00 00 06 00 00 00 6b 6b 6b 6b  ............kkkk
              a0 cc 9c 07 81 88 ff ff 78 a1 76 07 81 88 ff ff  ........x.v.....
            backtrace:
              [<ffffffff817e0167>] kmalloc_trace+0x27/0xa0
              [<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0
              [<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0
              [<ffffffff819c851e>] damon_test_apply_three_regions2+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff8881079ccc80 (size 56):
            comm "kunit_try_catch", pid 1071, jiffies 4294670595 (age 732.843s)
            hex dump (first 32 bytes):
              05 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00  ................
              6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b  kkkkkkkk....kkkk
            backtrace:
              [<ffffffff819bc492>] damon_new_region+0x22/0x1c0
              [<ffffffff819c7d91>] damon_do_test_apply_three_regions.constprop.0+0xd1/0x3e0
              [<ffffffff819c851e>] damon_test_apply_three_regions2+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff888107c9af40 (size 64):
            comm "kunit_try_catch", pid 1073, jiffies 4294670597 (age 733.011s)
            hex dump (first 32 bytes):
              00 00 00 00 00 00 00 00 06 00 00 00 6b 6b 6b 6b  ............kkkk
              20 a2 76 07 81 88 ff ff b8 a6 76 07 81 88 ff ff   .v.......v.....
            backtrace:
              [<ffffffff817e0167>] kmalloc_trace+0x27/0xa0
              [<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0
              [<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0
              [<ffffffff819c877e>] damon_test_apply_three_regions3+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff88810776a200 (size 56):
            comm "kunit_try_catch", pid 1073, jiffies 4294670597 (age 733.011s)
            hex dump (first 32 bytes):
              05 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00  ................
              6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b  kkkkkkkk....kkkk
            backtrace:
              [<ffffffff819bc492>] damon_new_region+0x22/0x1c0
              [<ffffffff819c7d91>] damon_do_test_apply_three_regions.constprop.0+0xd1/0x3e0
              [<ffffffff819c877e>] damon_test_apply_three_regions3+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff88810776a740 (size 56):
            comm "kunit_try_catch", pid 1073, jiffies 4294670597 (age 733.025s)
            hex dump (first 32 bytes):
              3d 00 00 00 00 00 00 00 3f 00 00 00 00 00 00 00  =.......?.......
              6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b  kkkkkkkk....kkkk
            backtrace:
              [<ffffffff819bc492>] damon_new_region+0x22/0x1c0
              [<ffffffff819bfcc2>] damon_set_regions+0x4c2/0x8e0
              [<ffffffff819c7dbb>] damon_do_test_apply_three_regions.constprop.0+0xfb/0x3e0
              [<ffffffff819c877e>] damon_test_apply_three_regions3+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff888108038240 (size 64):
            comm "kunit_try_catch", pid 1075, jiffies 4294670600 (age 733.022s)
            hex dump (first 32 bytes):
              00 00 00 00 00 00 00 00 03 00 00 00 6b 6b 6b 6b  ............kkkk
              48 ad 76 07 81 88 ff ff 98 ae 76 07 81 88 ff ff  H.v.......v.....
            backtrace:
              [<ffffffff817e0167>] kmalloc_trace+0x27/0xa0
              [<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0
              [<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0
              [<ffffffff819c898d>] damon_test_apply_three_regions4+0x1cd/0x210
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff88810776ad28 (size 56):
            comm "kunit_try_catch", pid 1075, jiffies 4294670600 (age 733.022s)
            hex dump (first 32 bytes):
              05 00 00 00 00 00 00 00 07 00 00 00 00 00 00 00  ................
              6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b  kkkkkkkk....kkkk
            backtrace:
              [<ffffffff819bc492>] damon_new_region+0x22/0x1c0
              [<ffffffff819bfcc2>] damon_set_regions+0x4c2/0x8e0
              [<ffffffff819c7dbb>] damon_do_test_apply_three_regions.constprop.0+0xfb/0x3e0
              [<ffffffff819c898d>] damon_test_apply_three_regions4+0x1cd/0x210
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
      
      Link: https://lkml.kernel.org/r/20230925072100.3725620-1-ruanjinjie@huawei.com
      Fixes: 9f86d624
      
       ("mm/damon/vaddr-test: remove unnecessary variables")
      Signed-off-by: default avatarJinjie Ruan <ruanjinjie@huawei.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      45120b15
    • Michal Hocko's avatar
      mm, memcg: reconsider kmem.limit_in_bytes deprecation · 4597648f
      Michal Hocko authored
      This reverts commits 86327e8e ("memcg: drop kmem.limit_in_bytes") and
      partially reverts 58056f77 ("memcg, kmem: further deprecate
      kmem.limit_in_bytes") which have incrementally removed support for the
      kernel memory accounting hard limit.  Unfortunately it has turned out that
      there is still userspace depending on the existence of
      memory.kmem.limit_in_bytes [1].  The underlying functionality is not
      really required but the non-existent file just confuses the userspace
      which fails in the result.  The patch to fix this on the userspace side
      has been submitted but it is hard to predict how it will propagate through
      the maze of 3rd party consumers of the software.
      
      Now, reverting alone 86327e8e is not an option because there is
      another set of userspace which cannot cope with ENOTSUPP returned when
      writing to the file.  Therefore we have to go and revisit 58056f77 as
      well.  There are two ways to go ahead.  Either we give up on the
      deprecation and fully revert 58056f77 as well or we can keep
      kmem.limit_in_bytes but make the write a noop and warn about the fact. 
      This should work for both known breaking workloads which depend on the
      existence but do not depend on the hard limit enforcement.
      
      Note to backporters to stable trees.  a8c49af3 ("memcg: add per-memcg
      total kernel memory stat") introduced in 4.18 has added memcg_account_kmem
      so the accounting is not done by obj_cgroup_charge_pages directly for v1
      anymore.  Prior kernels need to add it explicitly (thanks to Johannes for
      pointing this out).
      
      [akpm@linux-foundation.org: fix build - remove unused local]
      Link: http://lkml.kernel.org/r/20230920081101.GA12096@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net [1]
      Link: https://lkml.kernel.org/r/ZRE5VJozPZt9bRPy@dhcp22.suse.cz
      Fixes: 86327e8e ("memcg: drop kmem.limit_in_bytes")
      Fixes: 58056f77
      
       ("memcg, kmem: further deprecate kmem.limit_in_bytes")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Tejun heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4597648f
    • Domenico Cerasuolo's avatar
      mm: zswap: fix potential memory corruption on duplicate store · ca56489c
      Domenico Cerasuolo authored
      While stress-testing zswap a memory corruption was happening when writing
      back pages.  __frontswap_store used to check for duplicate entries before
      attempting to store a page in zswap, this was because if the store fails
      the old entry isn't removed from the tree.  This change removes duplicate
      entries in zswap_store before the actual attempt.
      
      [cerasuolodomenico@gmail.com: add a warning and a comment, per Johannes]
        Link: https://lkml.kernel.org/r/20230925130002.1929369-1-cerasuolodomenico@gmail.com
      Link: https://lkml.kernel.org/r/20230922172211.1704917-1-cerasuolodomenico@gmail.com
      Fixes: 42c06a0e
      
       ("mm: kill frontswap")
      Signed-off-by: default avatarDomenico Cerasuolo <cerasuolodomenico@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ca56489c
    • Ryan Roberts's avatar
      arm64: hugetlb: fix set_huge_pte_at() to work with all swap entries · 6f1bace9
      Ryan Roberts authored
      When called with a swap entry that does not embed a PFN (e.g. 
      PTE_MARKER_POISONED or PTE_MARKER_UFFD_WP), the previous implementation of
      set_huge_pte_at() would either cause a BUG() to fire (if CONFIG_DEBUG_VM
      is enabled) or cause a dereference of an invalid address and subsequent
      panic.
      
      arm64's huge pte implementation supports multiple huge page sizes, some of
      which are implemented in the page table with multiple contiguous entries. 
      So set_huge_pte_at() needs to work out how big the logical pte is, so that
      it can also work out how many physical ptes (or pmds) need to be written. 
      It previously did this by grabbing the folio out of the pte and querying
      its size.
      
      However, there are cases when the pte being set is actually a swap entry. 
      But this also used to work fine, because for huge ptes, we only ever saw
      migration entries and hwpoison entries.  And both of these types of swap
      entries have a PFN embedded, so the code would grab that and everything
      still worked out.
      
      But over time, more calls to set_huge_pte_at() have been added that set
      swap entry types that do not embed a PFN.  And this causes the code to go
      bang.  The triggering case is for the uffd poison test, commit
      99aa7721 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which
      causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit
      8a13897f ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") -
      added in v6.5-rc7.  Although review shows that there are other call sites
      that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger
      on arm64 because arm64 doesn't support UFFD WP.
      
      Arguably, the root cause is really due to commit 18f39629 ("mm:
      hugetlb: kill set_huge_swap_pte_at()"), which aimed to simplify the
      interface to the core code by removing set_huge_swap_pte_at() (which took
      a page size parameter) and replacing it with calls to set_huge_pte_at()
      where the size was inferred from the folio, as descibed above.  While that
      commit didn't break anything at the time, it did break the interface
      because it couldn't handle swap entries without PFNs.  And since then new
      callers have come along which rely on this working.  But given the
      brokeness is only observable after commit 8a13897f ("mm: userfaultfd:
      support UFFDIO_POISON for hugetlbfs"), that one gets the Fixes tag.
      
      Now that we have modified the set_huge_pte_at() interface to pass the huge
      page size in the previous patch, we can trivially fix this issue.
      
      Link: https://lkml.kernel.org/r/20230922115804.2043771-3-ryan.roberts@arm.com
      Fixes: 8a13897f
      
       ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs")
      Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: <stable@vger.kernel.org>	[6.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6f1bace9
    • Ryan Roberts's avatar
      mm: hugetlb: add huge page size param to set_huge_pte_at() · 935d4f0c
      Ryan Roberts authored
      Patch series "Fix set_huge_pte_at() panic on arm64", v2.
      
      This series fixes a bug in arm64's implementation of set_huge_pte_at(),
      which can result in an unprivileged user causing a kernel panic.  The
      problem was triggered when running the new uffd poison mm selftest for
      HUGETLB memory.  This test (and the uffd poison feature) was merged for
      v6.5-rc7.
      
      Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable
      (correctly this time) to get it backported to v6.5, where the issue first
      showed up.
      
      
      Description of Bug
      ==================
      
      arm64's huge pte implementation supports multiple huge page sizes, some of
      which are implemented in the page table with multiple contiguous entries. 
      So set_huge_pte_at() needs to work out how big the logical pte is, so that
      it can also work out how many physical ptes (or pmds) need to be written. 
      It previously did this by grabbing the folio out of the pte and querying
      its size.
      
      However, there are cases when the pte being set is actually a swap entry. 
      But this also used to work fine, because for huge ptes, we only ever saw
      migration entries and hwpoison entries.  And both of these types of swap
      entries have a PFN embedded, so the code would grab that and everything
      still worked out.
      
      But over time, more calls to set_huge_pte_at() have been added that set
      swap entry types that do not embed a PFN.  And this causes the code to go
      bang.  The triggering case is for the uffd poison test, commit
      99aa7721 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which
      causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit
      8a13897f ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") -
      added in v6.5-rc7.  Although review shows that there are other call sites
      that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger
      on arm64 because arm64 doesn't support UFFD WP.
      
      If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise,
      it will dereference a bad pointer in page_folio():
      
          static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
          {
              VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));
      
              return page_folio(pfn_to_page(swp_offset_pfn(entry)));
          }
      
      
      Fix
      ===
      
      The simplest fix would have been to revert the dodgy cleanup commit
      18f39629 ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since
      things have moved on, this would have required an audit of all the new
      set_huge_pte_at() call sites to see if they should be converted to
      set_huge_swap_pte_at().  As per the original intent of the change, it
      would also leave us open to future bugs when people invariably get it
      wrong and call the wrong helper.
      
      So instead, I've added a huge page size parameter to set_huge_pte_at(). 
      This means that the arm64 code has the size in all cases.  It's a bigger
      change, due to needing to touch the arches that implement the function,
      but it is entirely mechanical, so in my view, low risk.
      
      I've compile-tested all touched arches; arm64, parisc, powerpc, riscv,
      s390, sparc (and additionally x86_64).  I've additionally booted and run
      mm selftests against arm64, where I observe the uffd poison test is fixed,
      and there are no other regressions.
      
      
      This patch (of 2):
      
      In order to fix a bug, arm64 needs to be told the size of the huge page
      for which the pte is being set in set_huge_pte_at().  Provide for this by
      adding an `unsigned long sz` parameter to the function.  This follows the
      same pattern as huge_pte_clear().
      
      This commit makes the required interface modifications to the core mm as
      well as all arches that implement this function (arm64, parisc, powerpc,
      riscv, s390, sparc).  The actual arm64 bug will be fixed in a separate
      commit.
      
      No behavioral changes intended.
      
      Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com
      Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com
      Fixes: 8a13897f
      
       ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs")
      Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>	[powerpc 8xx]
      Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>	[vmalloc change]
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: <stable@vger.kernel.org>	[6.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      935d4f0c
    • Liam R. Howlett's avatar
      maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW states · a8091f03
      Liam R. Howlett authored
      When updating the maple tree iterator to avoid rewalks, an issue was
      introduced when shifting beyond the limits.  This can be seen by trying to
      go to the previous address of 0, which would set the maple node to
      MAS_NONE and keep the range as the last entry.
      
      Subsequent calls to mas_find() would then search upwards from mas->last
      and skip the value at mas->index/mas->last.  This showed up as a bug in
      mprotect which skips the actual VMA at the current range after attempting
      to go to the previous VMA from 0.
      
      Since MAS_NONE may already be set when searching for a value that isn't
      contained within a node, changing the handling of MAS_NONE in mas_find()
      would make the code more complicated and error prone.  Furthermore, there
      was no way to tell which limit was hit, and thus which action to take
      (next or the entry at the current range).
      
      This solution is to add two states to track what happened with the
      previous iterator action.  This allows for the expected behaviour of the
      next command to return the correct item (either the item at the range
      requested, or the next/previous).
      
      Tests are also added and updated accordingly.
      
      Link: https://lkml.kernel.org/r/20230921181236.509072-3-Liam.Howlett@oracle.com
      Link: https://gist.github.com/heatd/85d2971fae1501b55b6ea401fbbe485b
      Link: https://lore.kernel.org/linux-mm/20230921181236.509072-1-Liam.Howlett@oracle.com/
      Fixes: 39193685
      
       ("maple_tree: try harder to keep active node with mas_prev()")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reported-by: default avatarPedro Falcato <pedro.falcato@gmail.com>
      Closes: https://gist.github.com/heatd/85d2971fae1501b55b6ea401fbbe485b
      Closes: https://bugs.archlinux.org/task/79656
      
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a8091f03