Skip to content
  1. Jul 01, 2021
    • Axel Rasmussen's avatar
      userfaultfd/shmem: support UFFDIO_CONTINUE for shmem · 15313257
      Axel Rasmussen authored
      
      
      With this change, userspace can resolve a minor fault within a
      shmem-backed area with a UFFDIO_CONTINUE ioctl.  The semantics for this
      match those for hugetlbfs - we look up the existing page in the page
      cache, and install a PTE for it.
      
      This commit introduces a new helper: mfill_atomic_install_pte.
      
      Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in
      shmem.c?  The existing userfault implementation only relies on shmem.c for
      VM_SHARED VMAs.  However, minor fault handling / CONTINUE work just fine
      for !VM_SHARED VMAs as well.  We'd prefer to handle CONTINUE for shmem in
      one place, regardless of shared/private (to reduce code duplication).
      
      Why add a new mfill_atomic_install_pte helper?  A problem we have with
      continue is that shmem_mfill_atomic_pte() and mcopy_atomic_pte() are
      *close* to what we want, but not exactly.  We do want to setup the PTEs in
      a CONTINUE operation, but we don't want to e.g.  allocate a new page,
      charge it (e.g.  to the shmem inode), manipulate various flags, etc.  Also
      we have the problem stated above: shmem_mfill_atomic_pte() and
      mcopy_atomic_pte() both handle one-half of the problem (shared / private)
      continue cares about.  So, introduce mcontinue_atomic_pte(), to handle all
      of the shmem continue cases.  Introduce the helper so it doesn't duplicate
      code with mcopy_atomic_pte().
      
      In a future commit, shmem_mfill_atomic_pte() will also be modified to use
      this new helper.  However, since this is a bigger refactor, it seems most
      clear to do it as a separate change.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-5-axelrasmussen@google.com
      Signed-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15313257
    • Axel Rasmussen's avatar
      userfaultfd/shmem: support minor fault registration for shmem · c949b097
      Axel Rasmussen authored
      
      
      This patch allows shmem-backed VMAs to be registered for minor faults.
      Minor faults are appropriately relayed to userspace in the fault path, for
      VMAs with the relevant flag.
      
      This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed
      minor faults, though, so userspace doesn't yet have a way to resolve such
      faults.
      
      Because of this, we also don't yet advertise this as a supported feature.
      That will be done in a separate commit when the feature is fully
      implemented.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-4-axelrasmussen@google.com
      Signed-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c949b097
    • Axel Rasmussen's avatar
      userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte · 3460f6e5
      Axel Rasmussen authored
      
      
      Patch series "userfaultfd: add minor fault handling for shmem", v6.
      
      Overview
      ========
      
      See the series which added minor faults for hugetlbfs [3] for a detailed
      overview of minor fault handling in general.  This series adds the same
      support for shmem-backed areas.
      
      This series is structured as follows:
      
      - Commits 1 and 2 are cleanups.
      - Commits 3 and 4 implement the new feature (minor fault handling for shmem).
      - Commit 5 advertises that the feature is now available since at this point it's
        fully implemented.
      - Commit 6 is a final cleanup, modifying an existing code path to re-use a new
        helper we've introduced.
      - Commits 7, 8, 9, 10 update the userfaultfd selftest to exercise the feature.
      
      Use Case
      ========
      
      In some cases it is useful to have VM memory backed by tmpfs instead of
      hugetlbfs.  So, this feature will be used to support the same VM live
      migration use case described in my original series.
      
      Additionally, Android folks (Lokesh Gidra <lokeshgidra@google.com>) hope
      to optimize the Android Runtime garbage collector using this feature:
      
      "The plan is to use userfaultfd for concurrently compacting the heap.
      With this feature, the heap can be shared-mapped at another location where
      the GC-thread(s) could continue the compaction operation without the need
      to invoke userfault ioctl(UFFDIO_COPY) each time.  OTOH, if and when Java
      threads get faults on the heap, UFFDIO_CONTINUE can be used to resume
      execution.  Furthermore, this feature enables updating references in the
      'non-moving' portion of the heap efficiently.  Without this feature,
      uneccessary page copying (ioctl(UFFDIO_COPY)) would be required."
      
      [1] https://lore.kernel.org/patchwork/cover/1388144/
      [2] https://lore.kernel.org/patchwork/patch/1408161/
      [3] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@google.com/T/#t
      
      This patch (of 9):
      
      Previously, we did a dance where we had one calling path in userfaultfd.c
      (mfill_atomic_pte), but then we split it into two in shmem_fs.h
      (shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined into a single
      shared function in shmem.c (shmem_mfill_atomic_pte).
      
      This is all a bit overly complex.  Just call the single combined shmem
      function directly, allowing us to clean up various branches, boilerplate,
      etc.
      
      While we're touching this function, two other small cleanup changes:
      - offset is equivalent to pgoff, so we can get rid of offset entirely.
      - Split two VM_BUG_ON cases into two statements. This means the line
        number reported when the BUG is hit specifies exactly which condition
        was true.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20210503180737.2487560-3-axelrasmussen@google.com
      Signed-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3460f6e5
    • Peter Xu's avatar
      userfaultfd/selftests: add pagemap uffd-wp test · eb3b2e00
      Peter Xu authored
      
      
      Add one anonymous specific test to start using pagemap.  With pagemap
      support, we can directly read the uffd-wp bit from pgtable without
      triggering any fault, so it's easier to do sanity checks in unit tests.
      
      Meanwhile this test also leverages the newly introduced MADV_PAGEOUT
      madvise function to test swap ptes with uffd-wp bit set, and across
      fork()s.
      
      Link: https://lkml.kernel.org/r/20210428225030.9708-7-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb3b2e00
    • Peter Xu's avatar
      mm/pagemap: export uffd-wp protection information · fb8e37f3
      Peter Xu authored
      
      
      Export the PTE/PMD status of uffd-wp to pagemap too.
      
      Link: https://lkml.kernel.org/r/20210428225030.9708-6-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb8e37f3
    • Peter Xu's avatar
      mm/userfaultfd: fail uffd-wp registration if not supported · 00b151f2
      Peter Xu authored
      
      
      We should fail uffd-wp registration immediately if the arch does not even
      have CONFIG_HAVE_ARCH_USERFAULTFD_WP defined.  That'll block also relevant
      ioctls on e.g.  UFFDIO_WRITEPROTECT because that'll check against
      VM_UFFD_WP, which can only be applied with a success registration.
      
      Remove the WP feature bit too for those archs when handling UFFDIO_API
      ioctl.
      
      Link: https://lkml.kernel.org/r/20210428225030.9708-5-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00b151f2
    • Peter Xu's avatar
      mm/userfaultfd: fix uffd-wp special cases for fork() · 8f34f1ea
      Peter Xu authored
      We tried to do something similar in b569a176 ("userfaultfd: wp: drop
      _PAGE_UFFD_WP properly when fork") previously, but it's not doing it all
      right..  A few fixes around the code path:
      
      1. We were referencing VM_UFFD_WP vm_flags on the _old_ vma rather
         than the new vma.  That's overlooked in b569a176, so it won't work
         as expected.  Thanks to the recent rework on fork code
         (7a4830c3), we can easily get the new vma now, so switch the
         checks to that.
      
      2. Dropping the uffd-wp bit in copy_huge_pmd() could be wrong if the
         huge pmd is a migration huge pmd.  When it happens, instead of using
         pmd_uffd_wp(), we should use pmd_swp_uffd_wp().  The fix is simply to
         handle them separately.
      
      3. Forget to carry over uffd-wp bit for a write migration huge pmd
         entry.  This also happens in copy_huge_pmd(), where we converted a
         write huge migration entry into a read one.
      
      4. In copy_nonpresent_pte(), drop uffd-wp if necessary for swap ptes.
      
      5. In copy_present_page() when COW is enforced when fork(), we also
         need to pass over the uffd-wp bit if VM_UFFD_WP is armed on the new
         vma, and when the pte to be copied has uffd-wp bit set.
      
      Remove the comment in copy_present_pte() about this.  It won't help a huge
      lot to only comment there, but comment everywhere would be an overkill.
      Let's assume the commit messages would help.
      
      [peterx@redhat.com: fix a few thp pmd missing uffd-wp bit]
        Link: https://lkml.kernel.org/r/20210428225030.9708-4-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210428225030.9708-3-peterx@redhat.com
      Fixes: b569a176
      
       ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f34f1ea
    • Peter Xu's avatar
      mm/thp: simplify copying of huge zero page pmd when fork · 5fc7a5f6
      Peter Xu authored
      
      
      Patch series "mm/uffd: Misc fix for uffd-wp and one more test".
      
      This series tries to fix some corner case bugs for uffd-wp on either thp
      or fork().  Then it introduced a new test with pagemap/pageout.
      
      Patch layout:
      
      Patch 1:    cleanup for THP, it'll slightly simplify the follow up patches
      Patch 2-4:  misc fixes for uffd-wp here and there; please refer to each patch
      Patch 5:    add pagemap support for uffd-wp
      Patch 6:    add pagemap/pageout test for uffd-wp
      
      The last test introduced can also verify some of the fixes in previous
      patches, as the test will fail without the fixes.  However it's not easy
      to verify all the changes in patch 2-4, but hopefully they can still be
      properly reviewed.
      
      Note that if considering the ongoing uffd-wp shmem & hugetlbfs work, patch
      5 will be incomplete as it's missing e.g.  hugetlbfs part or the special
      swap pte detection.  However that's not needed in this series, and since
      that series is still during review, this series does not depend on that
      one (the last test only runs with anonymous memory, not file-backed).  So
      this series can be merged even before that series.
      
      This patch (of 6):
      
      Huge zero page is handled in a special path in copy_huge_pmd(), however it
      should share most codes with a normal thp page.  Trying to share more code
      with it by removing the special path.  The only leftover so far is the
      huge zero page refcounting (mm_get_huge_zero_page()), because that's
      separately done with a global counter.
      
      This prepares for a future patch to modify the huge pmd to be installed,
      so that we don't need to duplicate it explicitly into huge zero page case
      too.
      
      Link: https://lkml.kernel.org/r/20210428225030.9708-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20210428225030.9708-2-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>, peterx@redhat.com
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5fc7a5f6
    • Peter Xu's avatar
      userfaultfd/selftests: unify error handling · 42e584ee
      Peter Xu authored
      
      
      Introduce err()/_err() and replace all the different ways to fail the
      program, mostly "fprintf" and "perror" with tons of exit() calls.  Always
      stop the test program at any failure.
      
      Link: https://lkml.kernel.org/r/20210412232753.1012412-6-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      42e584ee
    • Peter Xu's avatar
      userfaultfd/selftests: only dump counts if mode enabled · de3ca8e4
      Peter Xu authored
      
      
      WP and MINOR modes are conditionally enabled on specific memory types.
      This patch avoids dumping tons of zeros for those cases when the modes are
      not supported at all.
      
      Link: https://lkml.kernel.org/r/20210412232753.1012412-5-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de3ca8e4
    • Peter Xu's avatar
      userfaultfd/selftests: dropping VERIFY check in locking_thread · 4e08e18a
      Peter Xu authored
      
      
      It tries to check against all zeros and looped for quite a few times.
      However after that we'll verify the same page with count_verify, while
      count_verify can never be zero.  So it means if it's a zero page we'll
      detect it anyways with below code.
      
      There's yet another place we conditionally check the fault flag - just do
      it unconditionally.
      
      Link: https://lkml.kernel.org/r/20210412232753.1012412-4-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e08e18a
    • Peter Xu's avatar
      userfaultfd/selftests: remove the time() check on delayed uffd · ba4f8c35
      Peter Xu authored
      
      
      There seems to have no guarantee that time() will return the same for the
      two calls even if there's no delay, e.g.  when a fault is accidentally
      crossing the changing of a second.  Meanwhile, this message is also not
      helping that much since delay could happen with a lot of reasons, e.g.,
      schedule latency of resolving thread.  It may not mean an issue with uffd.
      
      Neither do I saw this error triggered either in the past runs.  Even if it
      triggers, it'll be drown in all the rest of test logs.  Remove it.
      
      Link: https://lkml.kernel.org/r/20210412232753.1012412-3-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba4f8c35
    • Peter Xu's avatar
      userfaultfd/selftests: use user mode only · d2c6c06f
      Peter Xu authored
      
      
      Patch series "userfaultfd/selftests: A few cleanups", v2.
      
      I wanted to cleanup userfaultfd.c fault handling for a long time.  If it's
      not cleaned, when the new code grows the file it'll also grow the size
      that needs to be cleaned...  This is my attempt to cleanup the userfaultfd
      selftest on fault handling, to use an err() macro instead of either
      fprintf() or perror() then another exit() call.
      
      The huge cleanup is done in the last patch.  The first 4 patches are some
      other standalone cleanups for the same file, so I put them together.
      
      This patch (of 5):
      
      Userfaultfd selftest does not need to handle kernel initiated fault.  Set
      user mode so it can be run even if unprivileged_userfaultfd=0 (which is
      the default).
      
      Link: https://lkml.kernel.org/r/20210412232753.1012412-2-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2c6c06f
    • Naoya Horiguchi's avatar
      mm/hwpoison: disable pcp for page_handle_poison() · 510d25c9
      Naoya Horiguchi authored
      
      
      Recent changes by patch "mm/page_alloc: allow high-order pages to be
      stored on the per-cpu lists" makes kernels determine whether to use pcp by
      pcp_allowed_order(), which breaks soft-offline for hugetlb pages.
      
      Soft-offline dissolves a migration source page, then removes it from buddy
      free list, so it's assumed that any subpage of the soft-offlined hugepage
      are recognized as a buddy page just after returning from
      dissolve_free_huge_page().  pcp_allowed_order() returns true for hugetlb,
      so this assumption is no longer true.
      
      So disable pcp during dissolve_free_huge_page() and take_page_off_buddy()
      to prevent soft-offlined hugepages from linking to pcp lists.
      Soft-offline should not be common events so the impact on performance
      should be minimal.  And I think that the optimization of Mel's patch could
      benefit to hugetlb so zone_pcp_disable() is called only in hwpoison
      context.
      
      Link: https://lkml.kernel.org/r/20210617092626.291006-1-nao.horiguchi@gmail.com
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      510d25c9
    • Mike Kravetz's avatar
      hugetlb: address ref count racing in prep_compound_gigantic_page · 7118fc29
      Mike Kravetz authored
      In [1], Jann Horn points out a possible race between
      prep_compound_gigantic_page and __page_cache_add_speculative.  The root
      cause of the possible race is prep_compound_gigantic_page uncondittionally
      setting the ref count of pages to zero.  It does this because
      prep_compound_gigantic_page is handed a 'group' of pages from an allocator
      and needs to convert that group of pages to a compound page.  The ref
      count of each page in this 'group' is one as set by the allocator.
      However, the ref count of compound page tail pages must be zero.
      
      The potential race comes about when ref counted pages are returned from
      the allocator.  When this happens, other mm code could also take a
      reference on the page.  __page_cache_add_speculative is one such example.
      Therefore, prep_compound_gigantic_page can not just set the ref count of
      pages to zero as it does today.  Doing so would lose the reference taken
      by any other code.  This would lead to BUGs in code checking ref counts
      and could possibly even lead to memory corruption.
      
      There are two possible ways to address this issue.
      
      1) Make all allocators of gigantic groups of pages be able to return a
         properly constructed compound page.
      
      2) Make prep_compound_gigantic_page be more careful when constructing a
         compound page.
      
      This patch takes approach 2.
      
      In prep_compound_gigantic_page, use cmpxchg to only set ref count to zero
      if it is one.  If the cmpxchg fails, call synchronize_rcu() in the hope
      that the extra ref count will be driopped during a rcu grace period.  This
      is not a performance critical code path and the wait should be
      accceptable.  If the ref count is still inflated after the grace period,
      then undo any modifications made and return an error.
      
      Currently prep_compound_gigantic_page is type void and does not return
      errors.  Modify the two callers to check for and handle error returns.  On
      error, the caller must free the 'group' of pages as they can not be used
      to form a gigantic page.  After freeing pages, the runtime caller
      (alloc_fresh_huge_page) will retry the allocation once.  Boot time
      allocations can not be retried.
      
      The routine prep_compound_page also unconditionally sets the ref count of
      compound page tail pages to zero.  However, in this case the buddy
      allocator is constructing a compound page from freshly allocated pages.
      The ref count on those freshly allocated pages is already zero, so the
      set_page_count(p, 0) is unnecessary and could lead to confusion.  Just
      remove it.
      
      [1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20210622021423.154662-3-mike.kravetz@oracle.com
      Fixes: 58a84aa9
      
       ("thp: set compound tail page _count to zero")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarJann Horn <jannh@google.com>
      Cc: Youquan Song <youquan.song@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7118fc29
    • Mike Kravetz's avatar
      hugetlb: remove prep_compound_huge_page cleanup · 48b8d744
      Mike Kravetz authored
      
      
      Patch series "Fix prep_compound_gigantic_page ref count adjustment".
      
      These patches address the possible race between
      prep_compound_gigantic_page and __page_cache_add_speculative as described
      by Jann Horn in [1].
      
      The first patch simply removes the unnecessary/obsolete helper routine
      prep_compound_huge_page to make the actual fix a little simpler.
      
      The second patch is the actual fix and has a detailed explanation in the
      commit message.
      
      This potential issue has existed for almost 10 years and I am unaware of
      anyone actually hitting the race.  I did not cc stable, but would be happy
      to squash the patches and send to stable if anyone thinks that is a good
      idea.
      
      [1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
      
      This patch (of 2):
      
      I could not think of a reliable way to recreate the issue for testing.
      Rather, I 'simulated errors' to exercise all the error paths.
      
      The routine prep_compound_huge_page is a simple wrapper to call either
      prep_compound_gigantic_page or prep_compound_page.  However, it is only
      called from gather_bootmem_prealloc which only processes gigantic pages.
      Eliminate the routine and call prep_compound_gigantic_page directly.
      
      Link: https://lkml.kernel.org/r/20210622021423.154662-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20210622021423.154662-2-mike.kravetz@oracle.com
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Youquan Song <youquan.song@intel.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48b8d744
    • Muchun Song's avatar
      mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON · e6d41f12
      Muchun Song authored
      
      
      When using HUGETLB_PAGE_FREE_VMEMMAP, the freeing unused vmemmap pages
      associated with each HugeTLB page is default off.  Now the vmemmap is PMD
      mapped.  So there is no side effect when this feature is enabled with no
      HugeTLB pages in the system.  Someone may want to enable this feature in
      the compiler time instead of using boot command line.  So add a config to
      make it default on when someone do not want to enable it via command line.
      
      Link: https://lkml.kernel.org/r/20210616094915.34432-4-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6d41f12
    • Muchun Song's avatar
      mm: sparsemem: use huge PMD mapping for vmemmap pages · 2d7a2171
      Muchun Song authored
      
      
      The preparation of splitting huge PMD mapping of vmemmap pages is ready,
      so switch the mapping from PTE to PMD.
      
      Link: https://lkml.kernel.org/r/20210616094915.34432-3-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d7a2171
    • Muchun Song's avatar
      mm: sparsemem: split the huge PMD mapping of vmemmap pages · 3bc2b6a7
      Muchun Song authored
      
      
      Patch series "Split huge PMD mapping of vmemmap pages", v4.
      
      In order to reduce the difficulty of code review in series[1].  We disable
      huge PMD mapping of vmemmap pages when that feature is enabled.  In this
      series, we do not disable huge PMD mapping of vmemmap pages anymore.  We
      will split huge PMD mapping when needed.  When HugeTLB pages are freed
      from the pool we do not attempt coalasce and move back to a PMD mapping
      because it is much more complex.
      
      [1] https://lore.kernel.org/linux-doc/20210510030027.56044-1-songmuchun@bytedance.com/
      
      This patch (of 3):
      
      In [1], PMD mappings of vmemmap pages were disabled if the the feature
      hugetlb_free_vmemmap was enabled.  This was done to simplify the initial
      implementation of vmmemap freeing for hugetlb pages.  Now, remove this
      simplification by allowing PMD mapping and switching to PTE mappings as
      needed for allocated hugetlb pages.
      
      When a hugetlb page is allocated, the vmemmap page tables are walked to
      free vmemmap pages.  During this walk, split huge PMD mappings to PTE
      mappings as required.  In the unlikely case PTE pages can not be
      allocated, return error(ENOMEM) and do not optimize vmemmap of the hugetlb
      page.
      
      When HugeTLB pages are freed from the pool, we do not attempt to
      coalesce and move back to a PMD mapping because it is much more complex.
      
      [1] https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com
      
      Link: https://lkml.kernel.org/r/20210616094915.34432-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20210616094915.34432-2-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3bc2b6a7
    • Mina Almasry's avatar
      mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY · 8cc5fcbb
      Mina Almasry authored
      
      
      On UFFDIO_COPY, if we fail to copy the page contents while holding the
      hugetlb_fault_mutex, we will drop the mutex and return to the caller after
      allocating a page that consumed a reservation.  In this case there may be
      a fault that double consumes the reservation.  To handle this, we free the
      allocated page, fix the reservations, and allocate a temporary hugetlb
      page and return that to the caller.  When the caller does the copy outside
      of the lock, we again check the cache, and allocate a page consuming the
      reservation, and copy over the contents.
      
      Test:
      Hacked the code locally such that resv_huge_pages underflows produce
      a warning and the copy_huge_page_from_user() always fails, then:
      
      ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
              2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
      ./tools/testing/selftests/vm/userfaultfd hugetlb 10
      	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
      
      Both tests succeed and produce no warnings. After the
      test runs number of free/resv hugepages is correct.
      
      [yuehaibing@huawei.com: remove set but not used variable 'vm_alloc_shared']
        Link: https://lkml.kernel.org/r/20210601141610.28332-1-yuehaibing@huawei.com
      [almasrymina@google.com: fix allocation error check and copy func name]
        Link: https://lkml.kernel.org/r/20210605010626.1459873-1-almasrymina@google.com
      
      Link: https://lkml.kernel.org/r/20210528005029.88088-1-almasrymina@google.com
      Signed-off-by: default avatarMina Almasry <almasrymina@google.com>
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cc5fcbb
    • Nanyong Sun's avatar
      khugepaged: selftests: remove debug_cow · 22f3c951
      Nanyong Sun authored
      The debug_cow attribute had been removed since commit 4958e4d8 ("mm:
      thp: remove debug_cow switch"), so remove it in selftest code too,
      otherwise the khugepaged test will fail.
      
      Link: https://lkml.kernel.org/r/20210430051117.400189-1-sunnanyong@huawei.com
      Fixes: 4958e4d8
      
       ("mm: thp: remove debug_cow switch")
      Signed-off-by: default avatarNanyong Sun <sunnanyong@huawei.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22f3c951
    • Christophe Leroy's avatar
      powerpc/8xx: add support for huge pages on VMAP and VMALLOC · a6a8f7c4
      Christophe Leroy authored
      
      
      powerpc 8xx has 4 page sizes:
      - 4k
      - 16k
      - 512k
      - 8M
      
      At the time being, vmalloc and vmap only support huge pages which are leaf
      at PMD level.
      
      Here the PMD level is 4M, it doesn't correspond to any supported page
      size.
      
      For now, implement use of 16k and 512k pages which is done at PTE level.
      
      Support of 8M pages will be implemented later, it requires vmalloc to
      support hugepd tables.
      
      Link: https://lkml.kernel.org/r/8b972f1c03fb6bd59953035f0a3e4d26659de4f8.1620795204.git.christophe.leroy@csgroup.eu
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6a8f7c4
    • Christophe Leroy's avatar
      mm/vmalloc: enable mapping of huge pages at pte level in vmalloc · 3382bbee
      Christophe Leroy authored
      
      
      On some architectures like powerpc, there are huge pages that are mapped
      at pte level.
      
      Enable it in vmalloc.
      
      For that, architectures can provide arch_vmap_pte_supported_shift() that
      returns the shift for pages to map at pte level.
      
      Link: https://lkml.kernel.org/r/2c717e3b1fba1894d890feb7669f83025bfa314d.1620795204.git.christophe.leroy@csgroup.eu
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3382bbee
    • Christophe Leroy's avatar
      mm/vmalloc: enable mapping of huge pages at pte level in vmap · f7ee1f13
      Christophe Leroy authored
      
      
      On some architectures like powerpc, there are huge pages that are mapped
      at pte level.
      
      Enable it in vmap.
      
      For that, architectures can provide arch_vmap_pte_range_map_size() that
      returns the size of pages to map at pte level.
      
      Link: https://lkml.kernel.org/r/fb3ccc73377832ac6708181ec419128a2f98ce36.1620795204.git.christophe.leroy@csgroup.eu
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7ee1f13
    • Christophe Leroy's avatar
      mm/pgtable: add stubs for {pmd/pub}_{set/clear}_huge · c742199a
      Christophe Leroy authored
      
      
      For architectures with no PMD and/or no PUD, add stubs similar to what we
      have for architectures without P4D.
      
      [christophe.leroy@csgroup.eu: arm64: define only {pud/pmd}_{set/clear}_huge when useful]
        Link: https://lkml.kernel.org/r/73ec95f40cafbbb69bdfb43a7f53876fd845b0ce.1620990479.git.christophe.leroy@csgroup.eu
      [christophe.leroy@csgroup.eu: x86: define only {pud/pmd}_{set/clear}_huge when useful]
        Link: https://lkml.kernel.org/r/7fbf1b6bc3e15c07c24fa45278d57064f14c896b.1620930415.git.christophe.leroy@csgroup.eu
      
      Link: https://lkml.kernel.org/r/5ac5976419350e8e048d463a64cae449eb3ba4b0.1620795204.git.christophe.leroy@csgroup.eu
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c742199a
    • Christophe Leroy's avatar
      mm/hugetlb: change parameters of arch_make_huge_pte() · 79c1c594
      Christophe Leroy authored
      
      
      Patch series "Subject: [PATCH v2 0/5] Implement huge VMAP and VMALLOC on powerpc 8xx", v2.
      
      This series implements huge VMAP and VMALLOC on powerpc 8xx.
      
      Powerpc 8xx has 4 page sizes:
      - 4k
      - 16k
      - 512k
      - 8M
      
      At the time being, vmalloc and vmap only support huge pages which are
      leaf at PMD level.
      
      Here the PMD level is 4M, it doesn't correspond to any supported
      page size.
      
      For now, implement use of 16k and 512k pages which is done
      at PTE level.
      
      Support of 8M pages will be implemented later, it requires use of
      hugepd tables.
      
      To allow this, the architecture provides two functions:
      - arch_vmap_pte_range_map_size() which tells vmap_pte_range() what
      page size to use. A stub returning PAGE_SIZE is provided when the
      architecture doesn't provide this function.
      - arch_vmap_pte_supported_shift() which tells __vmalloc_node_range()
      what page shift to use for a given area size. A stub returning
      PAGE_SHIFT is provided when the architecture doesn't provide this
      function.
      
      This patch (of 5):
      
      At the time being, arch_make_huge_pte() has the following prototype:
      
        pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
      			   struct page *page, int writable);
      
      vma is used to get the pages shift or size.
      vma is also used on Sparc to get vm_flags.
      page is not used.
      writable is not used.
      
      In order to use this function without a vma, replace vma by shift and
      flags.  Also remove the used parameters.
      
      Link: https://lkml.kernel.org/r/cover.1620795204.git.christophe.leroy@csgroup.eu
      Link: https://lkml.kernel.org/r/f4633ac6a7da2f22f31a04a89e0a7026bb78b15b.1620795204.git.christophe.leroy@csgroup.eu
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79c1c594
    • Miaohe Lin's avatar
      mm/huge_memory.c: don't discard hugepage if other processes are mapping it · babbbdd0
      Miaohe Lin authored
      If other processes are mapping any other subpages of the hugepage, i.e.
      in pte-mapped thp case, page_mapcount() will return 1 incorrectly.  Then
      we would discard the page while other processes are still mapping it.  Fix
      it by using total_mapcount() which can tell whether other processes are
      still mapping it.
      
      Link: https://lkml.kernel.org/r/20210511134857.1581273-6-linmiaohe@huawei.com
      Fixes: b8d3c4c3
      
       ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      babbbdd0
    • Miaohe Lin's avatar
      mm/huge_memory.c: remove unnecessary tlb_remove_page_size() for huge zero pmd · 9132a468
      Miaohe Lin authored
      Commit aa88b68c ("thp: keep huge zero page pinned until tlb flush")
      introduced tlb_remove_page() for huge zero page to keep it pinned until
      flush is complete and prevents the page from being split under us.  But
      huge zero page is kept pinned until all relevant mm_users reach zero since
      the commit 6fcb52a5
      
       ("thp: reduce usage of huge zero page's atomic
      counter").  So tlb_remove_page_size() for huge zero pmd is unnecessary
      now.
      
      Link: https://lkml.kernel.org/r/20210511134857.1581273-5-linmiaohe@huawei.com
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9132a468
    • Miaohe Lin's avatar
      mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled() · e6be37b2
      Miaohe Lin authored
      Since commit 99cb0dbd ("mm,thp: add read-only THP support for
      (non-shmem) FS"), read-only THP file mapping is supported.  But it forgot
      to add checking for it in transparent_hugepage_enabled().  To fix it, we
      add checking for read-only THP file mapping and also introduce helper
      transhuge_vma_enabled() to check whether thp is enabled for specified vma
      to reduce duplicated code.  We rename transparent_hugepage_enabled to
      transparent_hugepage_active to make the code easier to follow as suggested
      by David Hildenbrand.
      
      [linmiaohe@huawei.com: define transhuge_vma_enabled next to transhuge_vma_suitable]
        Link: https://lkml.kernel.org/r/20210514093007.4117906-1-linmiaohe@huawei.com
      
      Link: https://lkml.kernel.org/r/20210511134857.1581273-4-linmiaohe@huawei.com
      Fixes: 99cb0dbd
      
       ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6be37b2
    • Miaohe Lin's avatar
      mm/huge_memory.c: use page->deferred_list · dfe5c51c
      Miaohe Lin authored
      
      
      Now that we can represent the location of ->deferred_list instead of
      ->mapping + ->index, make use of it to improve readability.
      
      Link: https://lkml.kernel.org/r/20210511134857.1581273-3-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dfe5c51c
    • Miaohe Lin's avatar
      mm/huge_memory.c: remove dedicated macro HPAGE_CACHE_INDEX_MASK · b2bd53f1
      Miaohe Lin authored
      
      
      Patch series "Cleanup and fixup for huge_memory:, v3.
      
      This series contains cleanups to remove dedicated macro and remove
      unnecessary tlb_remove_page_size() for huge zero pmd.  Also this adds
      missing read-only THP checking for transparent_hugepage_enabled() and
      avoids discarding hugepage if other processes are mapping it.  More
      details can be found in the respective changelogs.
      
      Thi patch (of 5):
      
      Rewrite the pgoff checking logic to remove macro HPAGE_CACHE_INDEX_MASK
      which is only used here to simplify the code.
      
      Link: https://lkml.kernel.org/r/20210511134857.1581273-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20210511134857.1581273-2-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2bd53f1
    • Shixin Liu's avatar
      mm/debug_vm_pgtable: remove redundant pfn_{pmd/pte}() and fix one comment mistake · b593b90d
      Shixin Liu authored
      
      
      Remove redundant pfn_{pmd/pte}() in {pmd/pte}_advanced_tests() and adjust
      pfn_pud() in pud_advanced_tests() to make it similar with other two
      functions.
      
      In addition, the branch condition should be CONFIG_TRANSPARENT_HUGEPAGE
      instead of CONFIG_ARCH_HAS_PTE_DEVMAP.
      
      Link: https://lkml.kernel.org/r/20210419071820.750217-2-liushixin2@huawei.com
      Signed-off-by: default avatarShixin Liu <liushixin2@huawei.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b593b90d
    • Shixin Liu's avatar
      mm/debug_vm_pgtable: move {pmd/pud}_huge_tests out of CONFIG_TRANSPARENT_HUGEPAGE · 5fe77be6
      Shixin Liu authored
      
      
      The functions {pmd/pud}_set_huge and {pmd/pud}_clear_huge are not
      dependent on THP.  Hence move {pmd/pud}_huge_tests out of
      CONFIG_TRANSPARENT_HUGEPAGE.
      
      Link: https://lkml.kernel.org/r/20210419071820.750217-1-liushixin2@huawei.com
      Signed-off-by: default avatarShixin Liu <liushixin2@huawei.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5fe77be6
    • Muchun Song's avatar
      mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate · 77490587
      Muchun Song authored
      
      
      All the infrastructure is ready, so we introduce nr_free_vmemmap_pages
      field in the hstate to indicate how many vmemmap pages associated with a
      HugeTLB page that can be freed to buddy allocator.  And initialize it in
      the hugetlb_vmemmap_init().  This patch is actual enablement of the
      feature.
      
      There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct page
      structs that can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP, so add a
      BUILD_BUG_ON to catch invalid usage of the tail struct page.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-10-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Tested-by: default avatarChen Huang <chenhuang5@huawei.com>
      Tested-by: default avatarBodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77490587
    • Muchun Song's avatar
      mm: memory_hotplug: disable memmap_on_memory when hugetlb_free_vmemmap enabled · 4bab4964
      Muchun Song authored
      
      
      The parameter of memory_hotplug.memmap_on_memory is not compatible with
      hugetlb_free_vmemmap.  So disable it when hugetlb_free_vmemmap is enabled.
      
      [akpm@linux-foundation.org: remove unneeded include, per Oscar]
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-9-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4bab4964
    • Muchun Song's avatar
      mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap · e9fdff87
      Muchun Song authored
      
      
      Add a kernel parameter hugetlb_free_vmemmap to enable the feature of
      freeing unused vmemmap pages associated with each hugetlb page on boot.
      
      We disable PMD mapping of vmemmap pages for x86-64 arch when this feature
      is enabled.  Because vmemmap_remap_free() depends on vmemmap being base
      page mapped.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Tested-by: default avatarChen Huang <chenhuang5@huawei.com>
      Tested-by: default avatarBodeddula Balasubramaniam <bodeddub@amazon.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9fdff87
    • Muchun Song's avatar
      mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page · ad2fa371
      Muchun Song authored
      
      
      When we free a HugeTLB page to the buddy allocator, we need to allocate
      the vmemmap pages associated with it.  However, we may not be able to
      allocate the vmemmap pages when the system is under memory pressure.  In
      this case, we just refuse to free the HugeTLB page.  This changes behavior
      in some corner cases as listed below:
      
       1) Failing to free a huge page triggered by the user (decrease nr_pages).
      
          User needs to try again later.
      
       2) Failing to free a surplus huge page when freed by the application.
      
          Try again later when freeing a huge page next time.
      
       3) Failing to dissolve a free huge page on ZONE_MOVABLE via
          offline_pages().
      
          This can happen when we have plenty of ZONE_MOVABLE memory, but
          not enough kernel memory to allocate vmemmmap pages.  We may even
          be able to migrate huge page contents, but will not be able to
          dissolve the source huge page.  This will prevent an offline
          operation and is unfortunate as memory offlining is expected to
          succeed on movable zones.  Users that depend on memory hotplug
          to succeed for movable zones should carefully consider whether the
          memory savings gained from this feature are worth the risk of
          possibly not being able to offline memory in certain situations.
      
       4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
          alloc_contig_range() - once we have that handling in place. Mainly
          affects CMA and virtio-mem.
      
          Similar to 3). virito-mem will handle migration errors gracefully.
          CMA might be able to fallback on other free areas within the CMA
          region.
      
      Vmemmap pages are allocated from the page freeing context.  In order for
      those allocations to be not disruptive (e.g.  trigger oom killer)
      __GFP_NORETRY is used.  hugetlb_lock is dropped for the allocation because
      a non sleeping allocation would be too fragile and it could fail too
      easily under memory pressure.  GFP_ATOMIC or other modes to access memory
      reserves is not used because we want to prevent consuming reserves under
      heavy hugetlb freeing.
      
      [mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
        Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
      [willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
        Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad2fa371
    • Muchun Song's avatar
      mm: hugetlb: defer freeing of HugeTLB pages · b65d4adb
      Muchun Song authored
      
      
      In the subsequent patch, we should allocate the vmemmap pages when freeing
      a HugeTLB page.  But update_and_free_page() can be called under any
      context, so we cannot use GFP_KERNEL to allocate vmemmap pages.  However,
      we can defer the actual freeing in a kworker to prevent from using
      GFP_ATOMIC to allocate the vmemmap pages.
      
      The __update_and_free_page() is where the call to allocate vmemmmap pages
      will be inserted.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-6-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b65d4adb
    • Muchun Song's avatar
      mm: hugetlb: free the vmemmap pages associated with each HugeTLB page · f41f2ed4
      Muchun Song authored
      
      
      Every HugeTLB has more than one struct page structure.  We __know__ that
      we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to
      store metadata associated with each HugeTLB.
      
      There are a lot of struct page structures associated with each HugeTLB
      page.  For tail pages, the value of compound_head is the same.  So we can
      reuse first page of tail page structures.  We map the virtual addresses of
      the remaining pages of tail page structures to the first tail page struct,
      and then free these page frames.  Therefore, we need to reserve two pages
      as vmemmap areas.
      
      When we allocate a HugeTLB page from the buddy, we can free some vmemmap
      pages associated with each HugeTLB page.  It is more appropriate to do it
      in the prep_new_huge_page().
      
      The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages
      associated with a HugeTLB page can be freed, returns zero for now, which
      means the feature is disabled.  We will enable it once all the
      infrastructure is there.
      
      [willy@infradead.org: fix documentation warning]
        Link: https://lkml.kernel.org/r/20210615200242.1716568-5-willy@infradead.org
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-5-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Tested-by: default avatarChen Huang <chenhuang5@huawei.com>
      Tested-by: default avatarBodeddula Balasubramaniam <bodeddub@amazon.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f41f2ed4
    • Muchun Song's avatar
      mm: hugetlb: gather discrete indexes of tail page · cd39d4e9
      Muchun Song authored
      
      
      For HugeTLB page, there are more metadata to save in the struct page.  But
      the head struct page cannot meet our needs, so we have to abuse other tail
      struct page to store the metadata.  In order to avoid conflicts caused by
      subsequent use of more tail struct pages, we can gather these discrete
      indexes of tail struct page.  In this case, it will be easier to add a new
      tail page index later.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-4-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Tested-by: default avatarChen Huang <chenhuang5@huawei.com>
      Tested-by: default avatarBodeddula Balasubramaniam <bodeddub@amazon.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd39d4e9