Skip to content
  1. Sep 27, 2022
    • Peter Xu's avatar
      mm/thp: carry over dirty bit when thp splits on pmd · 0ccf7f16
      Peter Xu authored
      Carry over the dirty bit from pmd to pte when a huge pmd splits.  It
      shouldn't be a correctness issue since when pmd_dirty() we'll have the
      page marked dirty anyway, however having dirty bit carried over helps the
      next initial writes of split ptes on some archs like x86.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-5-peterx@redhat.com
      
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarHuang Ying <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0ccf7f16
    • Peter Xu's avatar
      mm/swap: add swp_offset_pfn() to fetch PFN from swap entry · 0d206b5d
      Peter Xu authored
      We've got a bunch of special swap entries that stores PFN inside the swap
      offset fields.  To fetch the PFN, normally the user just calls
      swp_offset() assuming that'll be the PFN.
      
      Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
      max possible length of a PFN on the host, meanwhile doing proper check
      with MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the
      PFNs properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
      
      One reason to do so is we never tried to sanitize whether swap offset can
      really fit for storing PFN.  At the meantime, this patch also prepares us
      with the future possibility to store more information inside the swp
      offset field, so assuming "swp_offset(entry)" to be the PFN will not stand
      any more very soon.
      
      Replace many of the swp_offset() callers to use swp_offset_pfn() where
      proper.  Note that many of the existing users are not candidates for the
      replacement, e.g.:
      
        (1) When the swap entry is not a pfn swap entry at all, or,
        (2) when we wanna keep the whole swp_offset but only change the swp type.
      
      For the latter, it can happen when fork() triggered on a write-migration
      swap entry pte, we may want to only change the migration type from
      write->read but keep the rest, so it's not "fetching PFN" but "changing
      swap type only".  They're left aside so that when there're more
      information within the swp offset they'll be carried over naturally in
      those cases.
      
      Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
      the new swp_offset_pfn() is about.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-4-peterx@redhat.com
      
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0d206b5d
    • Peter Xu's avatar
      mm/swap: comment all the ifdef in swapops.h · eba4d770
      Peter Xu authored
      swapops.h contains quite a few layers of ifdef, some of the "else" and
      "endif" doesn't get proper comment on the macro so it's hard to follow on
      what are they referring to.  Add the comments.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-3-peterx@redhat.com
      
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Suggested-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Reviewed-by: default avatarHuang Ying <ying.huang@intel.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eba4d770
    • Peter Xu's avatar
      mm/x86: use SWP_TYPE_BITS in 3-level swap macros · 9c61d532
      Peter Xu authored
      Patch series "mm: Remember a/d bits for migration entries", v4.
      
      
      Problem
      =======
      
      When migrating a page, right now we always mark the migrated page as old &
      clean.
      
      However that could lead to at least two problems:
      
        (1) We lost the real hot/cold information while we could have persisted.
            That information shouldn't change even if the backing page is changed
            after the migration,
      
        (2) There can be always extra overhead on the immediate next access to
            any migrated page, because hardware MMU needs cycles to set the young
            bit again for reads, and dirty bits for write, as long as the
            hardware MMU supports these bits.
      
      Many of the recent upstream works showed that (2) is not something trivial
      and actually very measurable.  In my test case, reading 1G chunk of memory
      - jumping in page size intervals - could take 99ms just because of the
      extra setting on the young bit on a generic x86_64 system, comparing to
      4ms if young set.
      
      This issue is originally reported by Andrea Arcangeli.
      
      Solution
      ========
      
      To solve this problem, this patchset tries to remember the young/dirty
      bits in the migration entries and carry them over when recovering the
      ptes.
      
      We have the chance to do so because in many systems the swap offset is not
      really fully used.  Migration entries use swp offset to store PFN only,
      while the PFN is normally not as large as swp offset and normally smaller.
      It means we do have some free bits in swp offset that we can use to store
      things like A/D bits, and that's how this series tried to approach this
      problem.
      
      max_swapfile_size() is used here to detect per-arch offset length in swp
      entries.  We'll automatically remember the A/D bits when we find that we
      have enough swp offset field to keep both the PFN and the extra bits.
      
      Since max_swapfile_size() can be slow, the last two patches cache the
      results for it and also swap_migration_ad_supported as a whole.
      
      Known Issues / TODOs
      ====================
      
      We still haven't taught madvise() to recognize the new A/D bits in
      migration entries, namely MADV_COLD/MADV_FREE.  E.g.  when MADV_COLD upon
      a migration entry.  It's not clear yet on whether we should clear the A
      bit, or we should just drop the entry directly.
      
      We didn't teach idle page tracking on the new migration entries, because
      it'll need larger rework on the tree on rmap pgtable walk.  However it
      should make it already better because before this patchset page will be
      old page after migration, so the series will fix potential false negative
      of idle page tracking when pages were migrated before observing.
      
      The other thing is migration A/D bits will not start to working for
      private device swap entries.  The code is there for completeness but since
      private device swap entries do not yet have fields to store A/D bits, even
      if we'll persistent A/D across present pte switching to migration entry,
      we'll lose it again when the migration entry converted to private device
      swap entry.
      
      Tests
      =====
      
      After the patchset applied, the immediate read access test [1] of above 1G
      chunk after migration can shrink from 99ms to 4ms.  The test is done by
      moving 1G pages from node 0->1->0 then read it in page size jumps.  The
      test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
      
      Similar effect can also be measured when writting the memory the 1st time
      after migration.
      
      After applying the patchset, both initial immediate read/write after page
      migrated will perform similarly like before migration happened.
      
      Patch Layout
      ============
      
      Patch 1-2:  Cleanups from either previous versions or on swapops.h macros.
      
      Patch 3-4:  Prepare for the introduction of migration A/D bits
      
      Patch 5:    The core patch to remember young/dirty bit in swap offsets.
      
      Patch 6-7:  Cache relevant fields to make migration_entry_supports_ad() fast.
      
      [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
      
      
      This patch (of 7):
      
      Replace all the magic "5" with the macro.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20220811161331.37055-2-peterx@redhat.com
      
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarHuang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c61d532
    • Miaohe Lin's avatar
      mm, hwpoison: cleanup some obsolete comments · 9cf28191
      Miaohe Lin authored
      1.Remove meaningless comment in kill_proc(). That doesn't tell anything.
      2.Fix the wrong function name get_hwpoison_unless_zero(). It should be
      get_page_unless_zero().
      3.The gate keeper for free hwpoison page has moved to check_new_page().
      Update the corresponding comment.
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-7-linmiaohe@huawei.com
      
      
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9cf28191
    • Miaohe Lin's avatar
      mm, hwpoison: check PageTable() explicitly in hwpoison_user_mappings() · b680dae9
      Miaohe Lin authored
      PageTable can't be handled by memory_failure(). Filter it out explicitly in
      hwpoison_user_mappings(). This will also make code more consistent with the
      relevant check in unpoison_memory().
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-6-linmiaohe@huawei.com
      
      
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b680dae9
    • Miaohe Lin's avatar
      mm, hwpoison: avoid unneeded page_mapped_in_vma() overhead in collect_procs_anon() · 36537a67
      Miaohe Lin authored
      If vma->vm_mm != t->mm, there's no need to call page_mapped_in_vma() as
      add_to_kill() won't be called in this case. Move up the mm check to avoid
      possible unneeded calling to page_mapped_in_vma().
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-5-linmiaohe@huawei.com
      
      
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      36537a67
    • Miaohe Lin's avatar
      mm, hwpoison: use num_poisoned_pages_sub() to decrease num_poisoned_pages · 21c9e90a
      Miaohe Lin authored
      Use num_poisoned_pages_sub() to combine multiple atomic ops into one. Also
      num_poisoned_pages_dec() can be killed as there's no caller now.
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-4-linmiaohe@huawei.com
      
      
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      21c9e90a
    • Miaohe Lin's avatar
      mm, hwpoison: use __PageMovable() to detect non-lru movable pages · da294991
      Miaohe Lin authored
      It's more recommended to use __PageMovable() to detect non-lru movable
      pages. We can avoid bumping page refcnt via isolate_movable_page() for
      the isolated lru pages. Also if pages become PageLRU just after they're
      checked but before trying to isolate them, isolate_lru_page() will be
      called to do the right work.
      
      [linmiaohe@huawei.com: fixes per Naoya Horiguchi]
        Link: https://lkml.kernel.org/r/1f7ee86e-7d28-0d8c-e0de-b7a5a94519e8@huawei.com
      Link: https://lkml.kernel.org/r/20220830123604.25763-3-linmiaohe@huawei.com
      
      
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      da294991
    • Miaohe Lin's avatar
      mm, hwpoison: use ClearPageHWPoison() in memory_failure() · 2fe62e22
      Miaohe Lin authored
      Patch series "A few cleanup patches for memory-failure".
      
      his series contains a few cleanup patches to use __PageMovable() to detect
      non-lru movable pages, use num_poisoned_pages_sub() to reduce multiple
      atomic ops overheads and so on.  More details can be found in the
      respective changelogs.
      
      
      This patch (of 6):
      
      Use ClearPageHWPoison() instead of TestClearPageHWPoison() to clear page
      hwpoison flags to avoid unneeded full memory barrier overhead.
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20220830123604.25763-2-linmiaohe@huawei.com
      
      
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2fe62e22
    • Yang Shi's avatar
      mm: MADV_COLLAPSE: refetch vm_end after reacquiring mmap_lock · 4d24de94
      Yang Shi authored
      The syzbot reported the below problem:
      
      BUG: Bad page map in process syz-executor198  pte:8000000071c00227 pmd:74b30067
      addr:0000000020563000 vm_flags:08100077 anon_vma:ffff8880547d2200 mapping:0000000000000000 index:20563
      file:(null) fault:0x0 mmap:0x0 read_folio:0x0
      CPU: 1 PID: 3614 Comm: syz-executor198 Not tainted 6.0.0-rc3-next-20220901-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_bad_pte.cold+0x2a7/0x2d0 mm/memory.c:565
       vm_normal_page+0x10c/0x2a0 mm/memory.c:636
       hpage_collapse_scan_pmd+0x729/0x1da0 mm/khugepaged.c:1199
       madvise_collapse+0x481/0x910 mm/khugepaged.c:2433
       madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1062
       madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1236
       do_madvise.part.0+0x24a/0x340 mm/madvise.c:1415
       do_madvise mm/madvise.c:1428 [inline]
       __do_sys_madvise mm/madvise.c:1428 [inline]
       __se_sys_madvise mm/madvise.c:1426 [inline]
       __x64_sys_madvise+0x113/0x150 mm/madvise.c:1426
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f770ba87929
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f770ba18308 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 00007f770bb0f3f8 RCX: 00007f770ba87929
      RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
      RBP: 00007f770bb0f3f0 R08: 00007f770ba18700 R09: 0000000000000000
      R10: 00007f770ba18700 R11: 0000000000000246 R12: 00007f770bb0f3fc
      R13: 00007ffc2d8b62ef R14: 00007f770ba18400 R15: 0000000000022000
      
      Basically the test program does the below conceptually:
      1. mmap 0x2000000 - 0x21000000 as anonymous region
      2. mmap io_uring SQ stuff at 0x20563000 with MAP_FIXED, io_uring_mmap()
         actually remaps the pages with special PTEs
      3. call MADV_COLLAPSE for 0x20000000 - 0x21000000
      
      It actually triggered the below race:
      
                   CPU A                                          CPU B
      mmap 0x20000000 - 0x21000000 as anon
                                                 madvise_collapse is called on this area
                                                   Retrieve start and end address from the vma (NEVER updated later!)
                                                   Collapsed the first 2M area and dropped mmap_lock
      Acquire mmap_lock
      mmap io_uring file at 0x20563000
      Release mmap_lock
                                                   Reacquire mmap_lock
                                                   revalidate vma pass since 0x20200000 + 0x200000 > 0x20563000
                                                   scan the next 2M (0x20200000 - 0x20400000), but due to whatever reason it didn't release mmap_lock
                                                   scan the 3rd 2M area (start from 0x20400000)
                                                     get into the vma created by io_uring
      
      The hend should be updated after MADV_COLLAPSE reacquire mmap_lock since
      the vma may be shrunk.  We don't have to worry about shink from the other
      direction since it could be caught by hugepage_vma_revalidate().  Either
      no valid vma is found or the vma doesn't fit anymore.
      
      Link: https://lkml.kernel.org/r/20220914162220.787703-1-shy828301@gmail.com
      
      
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Reported-by: default avatar <syzbot+915f3e317adb0e85835f@syzkaller.appspotmail.com>
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarZach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4d24de94
    • Andrew Morton's avatar
      6d751329
    • Kees Cook's avatar
      x86/uaccess: avoid check_object_size() in copy_from_user_nmi() · 59298997
      Kees Cook authored
      The check_object_size() helper under CONFIG_HARDENED_USERCOPY is designed
      to skip any checks where the length is known at compile time as a
      reasonable heuristic to avoid "likely known-good" cases.  However, it can
      only do this when the copy_*_user() helpers are, themselves, inline too.
      
      Using find_vmap_area() requires taking a spinlock.  The
      check_object_size() helper can call find_vmap_area() when the destination
      is in vmap memory.  If show_regs() is called in interrupt context, it will
      attempt a call to copy_from_user_nmi(), which may call check_object_size()
      and then find_vmap_area().  If something in normal context happens to be
      in the middle of calling find_vmap_area() (with the spinlock held), the
      interrupt handler will hang forever.
      
      The copy_from_user_nmi() call is actually being called with a fixed-size
      length, so check_object_size() should never have been called in the first
      place.  Given the narrow constraints, just replace the
      __copy_from_user_inatomic() call with an open-coded version that calls
      only into the sanitizers and not check_object_size(), followed by a call
      to raw_copy_from_user().
      
      [akpm@linux-foundation.org: no instrument_copy_from_user() in my tree...]
      Link: https://lkml.kernel.org/r/20220919201648.2250764-1-keescook@chromium.org
      Link: https://lore.kernel.org/all/CAOUHufaPshtKrTWOz7T7QFYUNVGFm0JBjvM700Nhf9qEL9b3EQ@mail.gmail.com
      
      
      Fixes: 0aef499f ("mm/usercopy: Detect vmalloc overruns")
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reported-by: default avatarYu Zhao <yuzhao@google.com>
      Reported-by: default avatarFlorian Lehner <dev@der-flo.net>
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarFlorian Lehner <dev@der-flo.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Josh Poimboeuf <jpoimboe@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      59298997
    • Zi Yan's avatar
      mm/page_isolation: fix isolate_single_pageblock() isolation behavior · 80e2b584
      Zi Yan authored
      set_migratetype_isolate() does not allow isolating MIGRATE_CMA pageblocks
      unless it is used for CMA allocation.  isolate_single_pageblock() did not
      have the same behavior when it is used together with
      set_migratetype_isolate() in start_isolate_page_range().  This allows
      alloc_contig_range() with migratetype other than MIGRATE_CMA, like
      MIGRATE_MOVABLE (used by alloc_contig_pages()), to isolate first and last
      pageblock but fail the rest.  The failure leads to changing migratetype of
      the first and last pageblock to MIGRATE_MOVABLE from MIGRATE_CMA,
      corrupting the CMA region.  This can happen during gigantic page
      allocations.
      
      Like Doug said here:
      https://lore.kernel.org/linux-mm/a3363a52-883b-dcd1-b77f-f2bb378d6f2d@gmail.com/T/#u,
      for gigantic page allocations, the user would notice no difference,
      since the allocation on CMA region will fail as well as it did before. 
      But it might hurt the performance of device drivers that use CMA, since
      CMA region size decreases.
      
      Fix it by passing migratetype into isolate_single_pageblock(), so that
      set_migratetype_isolate() used by isolate_single_pageblock() will prevent
      the isolation happening.
      
      Link: https://lkml.kernel.org/r/20220914023913.1855924-1-zi.yan@sent.com
      
      
      Fixes: b2c9e2fb ("mm: make alloc_contig_range work at pageblock granularity")
      Signed-off-by: default avatarZi Yan <ziy@nvidia.com>
      Reported-by: default avatarDoug Berger <opendmb@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Doug Berger <opendmb@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      80e2b584
    • Shuai Xue's avatar
      mm,hwpoison: check mm when killing accessing process · 77677cdb
      Shuai Xue authored
      The GHES code calls memory_failure_queue() from IRQ context to queue work
      into workqueue and schedule it on the current CPU.  Then the work is
      processed in memory_failure_work_func() by kworker and calls
      memory_failure().
      
      When a page is already poisoned, commit a3f5d80e ("mm,hwpoison: send
      SIGBUS with error virutal address") make memory_failure() call
      kill_accessing_process() that:
      
          - holds mmap locking of current->mm
          - does pagetable walk to find the error virtual address
          - and sends SIGBUS to the current process with error info.
      
      However, the mm of kworker is not valid, resulting in a null-pointer
      dereference.  So check mm when killing the accessing process.
      
      [akpm@linux-foundation.org: remove unrelated whitespace alteration]
      Link: https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com
      
      
      Fixes: a3f5d80e ("mm,hwpoison: send SIGBUS with error virutal address")
      Signed-off-by: default avatarShuai Xue <xueshuai@linux.alibaba.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Bixuan Cui <cuibixuan@linux.alibaba.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      77677cdb
    • Doug Berger's avatar
      mm/hugetlb: correct demote page offset logic · 31731452
      Doug Berger authored
      With gigantic pages it may not be true that struct page structures are
      contiguous across the entire gigantic page.  The nth_page macro is used
      here in place of direct pointer arithmetic to correct for this.
      
      Mike said:
      
      : This error could cause addressing exceptions.  However, this is only
      : possible in configurations where CONFIG_SPARSEMEM &&
      : !CONFIG_SPARSEMEM_VMEMMAP.  Such a configuration option is rare and
      : unknown to be the default anywhere.
      
      Link: https://lkml.kernel.org/r/20220914190917.3517663-1-opendmb@gmail.com
      
      
      Fixes: 8531fc6f ("hugetlb: add hugetlb demote page support")
      Signed-off-by: default avatarDoug Berger <opendmb@gmail.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      31731452
    • Maurizio Lombardi's avatar
      mm: prevent page_frag_alloc() from corrupting the memory · dac22531
      Maurizio Lombardi authored
      A number of drivers call page_frag_alloc() with a fragment's size >
      PAGE_SIZE.
      
      In low memory conditions, __page_frag_cache_refill() may fail the order
      3 cache allocation and fall back to order 0; In this case, the cache
      will be smaller than the fragment, causing memory corruptions.
      
      Prevent this from happening by checking if the newly allocated cache is
      large enough for the fragment; if not, the allocation will fail and
      page_frag_alloc() will return NULL.
      
      Link: https://lkml.kernel.org/r/20220715125013.247085-1-mlombard@redhat.com
      
      
      Fixes: b63ae8ca ("mm/net: Rename and move page fragment handling from net/ to mm/")
      Signed-off-by: default avatarMaurizio Lombardi <mlombard@redhat.com>
      Reviewed-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Cc: Chen Lin <chen45464546@163.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dac22531
    • Sergei Antonov's avatar
      mm: bring back update_mmu_cache() to finish_fault() · 70427f6e
      Sergei Antonov authored
      Running this test program on ARMv4 a few times (sometimes just once)
      reproduces the bug.
      
      int main()
      {
              unsigned i;
              char paragon[SIZE];
              void* ptr;
      
              memset(paragon, 0xAA, SIZE);
              ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                         MAP_ANON | MAP_SHARED, -1, 0);
              if (ptr == MAP_FAILED) return 1;
              printf("ptr = %p\n", ptr);
              for (i=0;i<10000;i++){
                      memset(ptr, 0xAA, SIZE);
                      if (memcmp(ptr, paragon, SIZE)) {
                              printf("Unexpected bytes on iteration %u!!!\n", i);
                              break;
                      }
              }
              munmap(ptr, SIZE);
      }
      
      In the "ptr" buffer there appear runs of zero bytes which are aligned
      by 16 and their lengths are multiple of 16.
      
      Linux v5.11 does not have the bug, "git bisect" finds the first bad commit:
      f9ce0be7 ("mm: Cleanup faultaround and finish_fault() codepaths")
      
      Before the commit update_mmu_cache() was called during a call to
      filemap_map_pages() as well as finish_fault(). After the commit
      finish_fault() lacks it.
      
      Bring back update_mmu_cache() to finish_fault() to fix the bug.
      Also call update_mmu_tlb() only when returning VM_FAULT_NOPAGE to more
      closely reproduce the code of alloc_set_pte() function that existed before
      the commit.
      
      On many platforms update_mmu_cache() is nop:
       x86, see arch/x86/include/asm/pgtable
       ARMv6+, see arch/arm/include/asm/tlbflush.h
      So, it seems, few users ran into this bug.
      
      Link: https://lkml.kernel.org/r/20220908204809.2012451-1-saproj@gmail.com
      
      
      Fixes: f9ce0be7 ("mm: Cleanup faultaround and finish_fault() codepaths")
      Signed-off-by: default avatarSergei Antonov <saproj@gmail.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      70427f6e
    • Christoph Hellwig's avatar
      frontswap: don't call ->init if no ops are registered · 37dcc673
      Christoph Hellwig authored
      If no frontswap module (i.e.  zswap) was registered, frontswap_ops will be
      NULL.  In such situation, swapon crashes with the following stack trace:
      
        Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000
        Mem abort info:
          ESR = 0x0000000096000004
          EC = 0x25: DABT (current EL), IL = 32 bits
          SET = 0, FnV = 0
          EA = 0, S1PTW = 0
          FSC = 0x04: level 0 translation fault
        Data abort info:
          ISV = 0, ISS = 0x00000004
          CM = 0, WnR = 0
        user pgtable: 4k pages, 48-bit VAs, pgdp=00000020a4fab000
        [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
        Internal error: Oops: 96000004 [#1] SMP
        Modules linked in: zram fsl_dpaa2_eth pcs_lynx phylink ahci_qoriq crct10dif_ce ghash_ce sbsa_gwdt fsl_mc_dpio nvme lm90 nvme_core at803x xhci_plat_hcd rtc_fsl_ftm_alarm xgmac_mdio ahci_platform i2c_imx ip6_tables ip_tables fuse
        Unloaded tainted modules: cppc_cpufreq():1
        CPU: 10 PID: 761 Comm: swapon Not tainted 6.0.0-rc2-00454-g22100432cf14 #1
        Hardware name: SolidRun Ltd. SolidRun CEX7 Platform, BIOS EDK II Jun 21 2022
        pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
        pc : frontswap_init+0x38/0x60
        lr : __do_sys_swapon+0x8a8/0x9f4
        sp : ffff80000969bcf0
        x29: ffff80000969bcf0 x28: ffff37bee0d8fc00 x27: ffff80000a7f5000
        x26: fffffcdefb971e80 x25: ffffaba797453b90 x24: 0000000000000064
        x23: ffff37c1f209d1a8 x22: ffff37bee880e000 x21: ffffaba797748560
        x20: ffff37bee0d8fce4 x19: ffffaba797748488 x18: 0000000000000014
        x17: 0000000030ec029a x16: ffffaba795a479b0 x15: 0000000000000000
        x14: 0000000000000000 x13: 0000000000000030 x12: 0000000000000001
        x11: ffff37c63c0aba18 x10: 0000000000000000 x9 : ffffaba7956b8c88
        x8 : ffff80000969bcd0 x7 : 0000000000000000 x6 : 0000000000000000
        x5 : 0000000000000001 x4 : 0000000000000000 x3 : ffffaba79730f000
        x2 : ffff37bee0d8fc00 x1 : 0000000000000000 x0 : 0000000000000000
        Call trace:
        frontswap_init+0x38/0x60
        __do_sys_swapon+0x8a8/0x9f4
        __arm64_sys_swapon+0x28/0x3c
        invoke_syscall+0x78/0x100
        el0_svc_common.constprop.0+0xd4/0xf4
        do_el0_svc+0x38/0x4c
        el0_svc+0x34/0x10c
        el0t_64_sync_handler+0x11c/0x150
        el0t_64_sync+0x190/0x194
        Code: d000e283 910003fd f9006c41 f946d461 (f9400021)
        ---[ end trace 0000000000000000 ]---
      
      Link: https://lkml.kernel.org/r/20220909130829.3262926-1-hch@lst.de
      
      
      Fixes: 1da0d94a ("frontswap: remove support for multiple ops")
      Reported-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      37dcc673
    • Naoya Horiguchi's avatar
      mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() · 2b7aa91b
      Naoya Horiguchi authored
      NULL pointer dereference is triggered when calling thp split via debugfs
      on the system with offlined memory blocks.  With debug option enabled, the
      following kernel messages are printed out:
      
        page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
        flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
        raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
        raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
        page dumped because: unmovable page
        page:000000007d7ab72e is uninitialized and poisoned
        page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
        ------------[ cut here ]------------
        kernel BUG at include/linux/mm.h:1248!
        invalid opcode: 0000 [#1] PREEMPT SMP PTI
        CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
        ...
        RIP: 0010:split_huge_pages_write+0xcf4/0xe30
      
      This shows that page_to_nid() in page_zone() is unexpectedly called for an
      offlined memmap.
      
      Use pfn_to_online_page() to get struct page in PFN walker.
      
      Link: https://lkml.kernel.org/r/20220908041150.3430269-1-naoya.horiguchi@linux.dev
      
      
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")      [visible after d0dc12e8]
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Co-developed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: <stable@vger.kernel.org>	[5.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2b7aa91b
    • Minchan Kim's avatar
      mm: fix madivse_pageout mishandling on non-LRU page · 58d426a7
      Minchan Kim authored
      MADV_PAGEOUT tries to isolate non-LRU pages and gets a warning from
      isolate_lru_page below.
      
      Fix it by checking PageLRU in advance.
      
      ------------[ cut here ]------------
      trying to isolate tail page
      WARNING: CPU: 0 PID: 6175 at mm/folio-compat.c:158 isolate_lru_page+0x130/0x140
      Modules linked in:
      CPU: 0 PID: 6175 Comm: syz-executor.0 Not tainted 5.18.12 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
      RIP: 0010:isolate_lru_page+0x130/0x140
      
      Link: https://lore.kernel.org/linux-mm/485f8c33.2471b.182d5726afb.Coremail.hantianshuo@iie.ac.cn/
      Link: https://lkml.kernel.org/r/20220908151204.762596-1-minchan@kernel.org
      
      
      Fixes: 1a4e58cc ("mm: introduce MADV_PAGEOUT")
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatar韩天ç`• <hantianshuo@iie.ac.cn>
      Suggested-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      58d426a7
    • Yang Shi's avatar
      powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush · bedf0341
      Yang Shi authored
      The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will
      move to use RCU instead of disabling local interrupts in fast-GUP.  Using
      an IPI is the old-styled way of serializing against fast-GUP although it
      still works as expected now.
      
      And fast-GUP now fixed the potential race with THP collapse by checking
      whether PMD is changed or not.  So IPI broadcast in radix pmd collapse
      flush is not necessary anymore.  But it is still needed for hash TLB.
      
      Link: https://lkml.kernel.org/r/20220907180144.555485-2-shy828301@gmail.com
      
      
      Suggested-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bedf0341
    • Yang Shi's avatar
      mm: gup: fix the fast GUP race against THP collapse · 70cbc3cc
      Yang Shi authored
      Since general RCU GUP fast was introduced in commit 2667f50e ("mm:
      introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
      sufficient to handle concurrent GUP-fast in all cases, it only handles
      traditional IPI-based GUP-fast correctly.  On architectures that send an
      IPI broadcast on TLB flush, it works as expected.  But on the
      architectures that do not use IPI to broadcast TLB flush, it may have the
      below race:
      
         CPU A                                          CPU B
      THP collapse                                     fast GUP
                                                    gup_pmd_range() <-- see valid pmd
                                                        gup_pte_range() <-- work on pte
      pmdp_collapse_flush() <-- clear pmd and flush
      __collapse_huge_page_isolate()
          check page pinned <-- before GUP bump refcount
                                                            pin the page
                                                            check PTE <-- no change
      __collapse_huge_page_copy()
          copy data to huge page
          ptep_clear()
      install huge pmd for the huge page
                                                            return the stale page
      discard the stale page
      
      The race can be fixed by checking whether PMD is changed or not after
      taking the page pin in fast GUP, just like what it does for PTE.  If the
      PMD is changed it means there may be parallel THP collapse, so GUP should
      back off.
      
      Also update the stale comment about serializing against fast GUP in
      khugepaged.
      
      Link: https://lkml.kernel.org/r/20220907180144.555485-1-shy828301@gmail.com
      
      
      Fixes: 2667f50e ("mm: introduce a general RCU get_user_pages_fast()")
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      70cbc3cc
  2. Sep 12, 2022