Skip to content
  1. Jul 01, 2021
    • Christophe Leroy's avatar
      mm/hugetlb: change parameters of arch_make_huge_pte() · 79c1c594
      Christophe Leroy authored
      
      
      Patch series "Subject: [PATCH v2 0/5] Implement huge VMAP and VMALLOC on powerpc 8xx", v2.
      
      This series implements huge VMAP and VMALLOC on powerpc 8xx.
      
      Powerpc 8xx has 4 page sizes:
      - 4k
      - 16k
      - 512k
      - 8M
      
      At the time being, vmalloc and vmap only support huge pages which are
      leaf at PMD level.
      
      Here the PMD level is 4M, it doesn't correspond to any supported
      page size.
      
      For now, implement use of 16k and 512k pages which is done
      at PTE level.
      
      Support of 8M pages will be implemented later, it requires use of
      hugepd tables.
      
      To allow this, the architecture provides two functions:
      - arch_vmap_pte_range_map_size() which tells vmap_pte_range() what
      page size to use. A stub returning PAGE_SIZE is provided when the
      architecture doesn't provide this function.
      - arch_vmap_pte_supported_shift() which tells __vmalloc_node_range()
      what page shift to use for a given area size. A stub returning
      PAGE_SHIFT is provided when the architecture doesn't provide this
      function.
      
      This patch (of 5):
      
      At the time being, arch_make_huge_pte() has the following prototype:
      
        pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
      			   struct page *page, int writable);
      
      vma is used to get the pages shift or size.
      vma is also used on Sparc to get vm_flags.
      page is not used.
      writable is not used.
      
      In order to use this function without a vma, replace vma by shift and
      flags.  Also remove the used parameters.
      
      Link: https://lkml.kernel.org/r/cover.1620795204.git.christophe.leroy@csgroup.eu
      Link: https://lkml.kernel.org/r/f4633ac6a7da2f22f31a04a89e0a7026bb78b15b.1620795204.git.christophe.leroy@csgroup.eu
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79c1c594
    • Miaohe Lin's avatar
      mm/huge_memory.c: don't discard hugepage if other processes are mapping it · babbbdd0
      Miaohe Lin authored
      If other processes are mapping any other subpages of the hugepage, i.e.
      in pte-mapped thp case, page_mapcount() will return 1 incorrectly.  Then
      we would discard the page while other processes are still mapping it.  Fix
      it by using total_mapcount() which can tell whether other processes are
      still mapping it.
      
      Link: https://lkml.kernel.org/r/20210511134857.1581273-6-linmiaohe@huawei.com
      Fixes: b8d3c4c3
      
       ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      babbbdd0
    • Miaohe Lin's avatar
      mm/huge_memory.c: remove unnecessary tlb_remove_page_size() for huge zero pmd · 9132a468
      Miaohe Lin authored
      Commit aa88b68c ("thp: keep huge zero page pinned until tlb flush")
      introduced tlb_remove_page() for huge zero page to keep it pinned until
      flush is complete and prevents the page from being split under us.  But
      huge zero page is kept pinned until all relevant mm_users reach zero since
      the commit 6fcb52a5
      
       ("thp: reduce usage of huge zero page's atomic
      counter").  So tlb_remove_page_size() for huge zero pmd is unnecessary
      now.
      
      Link: https://lkml.kernel.org/r/20210511134857.1581273-5-linmiaohe@huawei.com
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9132a468
    • Miaohe Lin's avatar
      mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled() · e6be37b2
      Miaohe Lin authored
      Since commit 99cb0dbd ("mm,thp: add read-only THP support for
      (non-shmem) FS"), read-only THP file mapping is supported.  But it forgot
      to add checking for it in transparent_hugepage_enabled().  To fix it, we
      add checking for read-only THP file mapping and also introduce helper
      transhuge_vma_enabled() to check whether thp is enabled for specified vma
      to reduce duplicated code.  We rename transparent_hugepage_enabled to
      transparent_hugepage_active to make the code easier to follow as suggested
      by David Hildenbrand.
      
      [linmiaohe@huawei.com: define transhuge_vma_enabled next to transhuge_vma_suitable]
        Link: https://lkml.kernel.org/r/20210514093007.4117906-1-linmiaohe@huawei.com
      
      Link: https://lkml.kernel.org/r/20210511134857.1581273-4-linmiaohe@huawei.com
      Fixes: 99cb0dbd
      
       ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6be37b2
    • Miaohe Lin's avatar
      mm/huge_memory.c: use page->deferred_list · dfe5c51c
      Miaohe Lin authored
      
      
      Now that we can represent the location of ->deferred_list instead of
      ->mapping + ->index, make use of it to improve readability.
      
      Link: https://lkml.kernel.org/r/20210511134857.1581273-3-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dfe5c51c
    • Miaohe Lin's avatar
      mm/huge_memory.c: remove dedicated macro HPAGE_CACHE_INDEX_MASK · b2bd53f1
      Miaohe Lin authored
      
      
      Patch series "Cleanup and fixup for huge_memory:, v3.
      
      This series contains cleanups to remove dedicated macro and remove
      unnecessary tlb_remove_page_size() for huge zero pmd.  Also this adds
      missing read-only THP checking for transparent_hugepage_enabled() and
      avoids discarding hugepage if other processes are mapping it.  More
      details can be found in the respective changelogs.
      
      Thi patch (of 5):
      
      Rewrite the pgoff checking logic to remove macro HPAGE_CACHE_INDEX_MASK
      which is only used here to simplify the code.
      
      Link: https://lkml.kernel.org/r/20210511134857.1581273-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20210511134857.1581273-2-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2bd53f1
    • Shixin Liu's avatar
      mm/debug_vm_pgtable: remove redundant pfn_{pmd/pte}() and fix one comment mistake · b593b90d
      Shixin Liu authored
      
      
      Remove redundant pfn_{pmd/pte}() in {pmd/pte}_advanced_tests() and adjust
      pfn_pud() in pud_advanced_tests() to make it similar with other two
      functions.
      
      In addition, the branch condition should be CONFIG_TRANSPARENT_HUGEPAGE
      instead of CONFIG_ARCH_HAS_PTE_DEVMAP.
      
      Link: https://lkml.kernel.org/r/20210419071820.750217-2-liushixin2@huawei.com
      Signed-off-by: default avatarShixin Liu <liushixin2@huawei.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b593b90d
    • Shixin Liu's avatar
      mm/debug_vm_pgtable: move {pmd/pud}_huge_tests out of CONFIG_TRANSPARENT_HUGEPAGE · 5fe77be6
      Shixin Liu authored
      
      
      The functions {pmd/pud}_set_huge and {pmd/pud}_clear_huge are not
      dependent on THP.  Hence move {pmd/pud}_huge_tests out of
      CONFIG_TRANSPARENT_HUGEPAGE.
      
      Link: https://lkml.kernel.org/r/20210419071820.750217-1-liushixin2@huawei.com
      Signed-off-by: default avatarShixin Liu <liushixin2@huawei.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5fe77be6
    • Muchun Song's avatar
      mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate · 77490587
      Muchun Song authored
      
      
      All the infrastructure is ready, so we introduce nr_free_vmemmap_pages
      field in the hstate to indicate how many vmemmap pages associated with a
      HugeTLB page that can be freed to buddy allocator.  And initialize it in
      the hugetlb_vmemmap_init().  This patch is actual enablement of the
      feature.
      
      There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct page
      structs that can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP, so add a
      BUILD_BUG_ON to catch invalid usage of the tail struct page.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-10-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Tested-by: default avatarChen Huang <chenhuang5@huawei.com>
      Tested-by: default avatarBodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77490587
    • Muchun Song's avatar
      mm: memory_hotplug: disable memmap_on_memory when hugetlb_free_vmemmap enabled · 4bab4964
      Muchun Song authored
      
      
      The parameter of memory_hotplug.memmap_on_memory is not compatible with
      hugetlb_free_vmemmap.  So disable it when hugetlb_free_vmemmap is enabled.
      
      [akpm@linux-foundation.org: remove unneeded include, per Oscar]
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-9-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4bab4964
    • Muchun Song's avatar
      mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap · e9fdff87
      Muchun Song authored
      
      
      Add a kernel parameter hugetlb_free_vmemmap to enable the feature of
      freeing unused vmemmap pages associated with each hugetlb page on boot.
      
      We disable PMD mapping of vmemmap pages for x86-64 arch when this feature
      is enabled.  Because vmemmap_remap_free() depends on vmemmap being base
      page mapped.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Tested-by: default avatarChen Huang <chenhuang5@huawei.com>
      Tested-by: default avatarBodeddula Balasubramaniam <bodeddub@amazon.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9fdff87
    • Muchun Song's avatar
      mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page · ad2fa371
      Muchun Song authored
      
      
      When we free a HugeTLB page to the buddy allocator, we need to allocate
      the vmemmap pages associated with it.  However, we may not be able to
      allocate the vmemmap pages when the system is under memory pressure.  In
      this case, we just refuse to free the HugeTLB page.  This changes behavior
      in some corner cases as listed below:
      
       1) Failing to free a huge page triggered by the user (decrease nr_pages).
      
          User needs to try again later.
      
       2) Failing to free a surplus huge page when freed by the application.
      
          Try again later when freeing a huge page next time.
      
       3) Failing to dissolve a free huge page on ZONE_MOVABLE via
          offline_pages().
      
          This can happen when we have plenty of ZONE_MOVABLE memory, but
          not enough kernel memory to allocate vmemmmap pages.  We may even
          be able to migrate huge page contents, but will not be able to
          dissolve the source huge page.  This will prevent an offline
          operation and is unfortunate as memory offlining is expected to
          succeed on movable zones.  Users that depend on memory hotplug
          to succeed for movable zones should carefully consider whether the
          memory savings gained from this feature are worth the risk of
          possibly not being able to offline memory in certain situations.
      
       4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
          alloc_contig_range() - once we have that handling in place. Mainly
          affects CMA and virtio-mem.
      
          Similar to 3). virito-mem will handle migration errors gracefully.
          CMA might be able to fallback on other free areas within the CMA
          region.
      
      Vmemmap pages are allocated from the page freeing context.  In order for
      those allocations to be not disruptive (e.g.  trigger oom killer)
      __GFP_NORETRY is used.  hugetlb_lock is dropped for the allocation because
      a non sleeping allocation would be too fragile and it could fail too
      easily under memory pressure.  GFP_ATOMIC or other modes to access memory
      reserves is not used because we want to prevent consuming reserves under
      heavy hugetlb freeing.
      
      [mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
        Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
      [willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
        Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad2fa371
    • Muchun Song's avatar
      mm: hugetlb: defer freeing of HugeTLB pages · b65d4adb
      Muchun Song authored
      
      
      In the subsequent patch, we should allocate the vmemmap pages when freeing
      a HugeTLB page.  But update_and_free_page() can be called under any
      context, so we cannot use GFP_KERNEL to allocate vmemmap pages.  However,
      we can defer the actual freeing in a kworker to prevent from using
      GFP_ATOMIC to allocate the vmemmap pages.
      
      The __update_and_free_page() is where the call to allocate vmemmmap pages
      will be inserted.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-6-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b65d4adb
    • Muchun Song's avatar
      mm: hugetlb: free the vmemmap pages associated with each HugeTLB page · f41f2ed4
      Muchun Song authored
      
      
      Every HugeTLB has more than one struct page structure.  We __know__ that
      we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to
      store metadata associated with each HugeTLB.
      
      There are a lot of struct page structures associated with each HugeTLB
      page.  For tail pages, the value of compound_head is the same.  So we can
      reuse first page of tail page structures.  We map the virtual addresses of
      the remaining pages of tail page structures to the first tail page struct,
      and then free these page frames.  Therefore, we need to reserve two pages
      as vmemmap areas.
      
      When we allocate a HugeTLB page from the buddy, we can free some vmemmap
      pages associated with each HugeTLB page.  It is more appropriate to do it
      in the prep_new_huge_page().
      
      The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages
      associated with a HugeTLB page can be freed, returns zero for now, which
      means the feature is disabled.  We will enable it once all the
      infrastructure is there.
      
      [willy@infradead.org: fix documentation warning]
        Link: https://lkml.kernel.org/r/20210615200242.1716568-5-willy@infradead.org
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-5-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Tested-by: default avatarChen Huang <chenhuang5@huawei.com>
      Tested-by: default avatarBodeddula Balasubramaniam <bodeddub@amazon.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f41f2ed4
    • Muchun Song's avatar
      mm: hugetlb: gather discrete indexes of tail page · cd39d4e9
      Muchun Song authored
      
      
      For HugeTLB page, there are more metadata to save in the struct page.  But
      the head struct page cannot meet our needs, so we have to abuse other tail
      struct page to store the metadata.  In order to avoid conflicts caused by
      subsequent use of more tail struct pages, we can gather these discrete
      indexes of tail struct page.  In this case, it will be easier to add a new
      tail page index later.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-4-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Tested-by: default avatarChen Huang <chenhuang5@huawei.com>
      Tested-by: default avatarBodeddula Balasubramaniam <bodeddub@amazon.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd39d4e9
    • Muchun Song's avatar
      mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP · 6be24bed
      Muchun Song authored
      
      
      The option HUGETLB_PAGE_FREE_VMEMMAP allows for the freeing of some
      vmemmap pages associated with pre-allocated HugeTLB pages.  For example,
      on X86_64 6 vmemmap pages of size 4KB each can be saved for each 2MB
      HugeTLB page.  4094 vmemmap pages of size 4KB each can be saved for each
      1GB HugeTLB page.
      
      When a HugeTLB page is allocated or freed, the vmemmap array representing
      the range associated with the page will need to be remapped.  When a page
      is allocated, vmemmap pages are freed after remapping.  When a page is
      freed, previously discarded vmemmap pages must be allocated before
      remapping.
      
      The config option is introduced early so that supporting code can be
      written to depend on the option.  The initial version of the code only
      provides support for x86-64.
      
      If config HAVE_BOOTMEM_INFO_NODE is enabled, the freeing vmemmap page code
      denpend on it to free vmemmap pages.  Otherwise, just use
      free_reserved_page() to free vmemmmap pages.  The routine
      register_page_bootmem_info() is used to register bootmem info.  Therefore,
      make sure register_page_bootmem_info is enabled if
      HUGETLB_PAGE_FREE_VMEMMAP is defined.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-3-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Tested-by: default avatarChen Huang <chenhuang5@huawei.com>
      Tested-by: default avatarBodeddula Balasubramaniam <bodeddub@amazon.com>
      Reviewed-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6be24bed
    • Muchun Song's avatar
      mm: memory_hotplug: factor out bootmem core functions to bootmem_info.c · 426e5c42
      Muchun Song authored
      
      
      Patch series "Free some vmemmap pages of HugeTLB page", v23.
      
      This patch series will free some vmemmap pages(struct page structures)
      associated with each HugeTLB page when preallocated to save memory.
      
      In order to reduce the difficulty of the first version of code review.  In
      this version, we disable PMD/huge page mapping of vmemmap if this feature
      was enabled.  This acutely eliminates a bunch of the complex code doing
      page table manipulation.  When this patch series is solid, we cam add the
      code of vmemmap page table manipulation in the future.
      
      The struct page structures (page structs) are used to describe a physical
      page frame.  By default, there is an one-to-one mapping from a page frame
      to it's corresponding page struct.
      
      The HugeTLB pages consist of multiple base page size pages and is
      supported by many architectures.  See hugetlbpage.rst in the Documentation
      directory for more details.  On the x86 architecture, HugeTLB pages of
      size 2MB and 1GB are currently supported.  Since the base page size on x86
      is 4KB, a 2MB HugeTLB page consists of 512 base pages and a 1GB HugeTLB
      page consists of 4096 base pages.  For each base page, there is a
      corresponding page struct.
      
      Within the HugeTLB subsystem, only the first 4 page structs are used to
      contain unique information about a HugeTLB page.  HUGETLB_CGROUP_MIN_ORDER
      provides this upper limit.  The only 'useful' information in the remaining
      page structs is the compound_head field, and this field is the same for
      all tail pages.
      
      By removing redundant page structs for HugeTLB pages, memory can returned
      to the buddy allocator for other uses.
      
      When the system boot up, every 2M HugeTLB has 512 struct page structs which
      size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE).
      
          HugeTLB                  struct pages(8 pages)         page frame(8 pages)
       +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
       |           |                     |     0     | -------------> |     0     |
       |           |                     +-----------+                +-----------+
       |           |                     |     1     | -------------> |     1     |
       |           |                     +-----------+                +-----------+
       |           |                     |     2     | -------------> |     2     |
       |           |                     +-----------+                +-----------+
       |           |                     |     3     | -------------> |     3     |
       |           |                     +-----------+                +-----------+
       |           |                     |     4     | -------------> |     4     |
       |    2MB    |                     +-----------+                +-----------+
       |           |                     |     5     | -------------> |     5     |
       |           |                     +-----------+                +-----------+
       |           |                     |     6     | -------------> |     6     |
       |           |                     +-----------+                +-----------+
       |           |                     |     7     | -------------> |     7     |
       |           |                     +-----------+                +-----------+
       |           |
       |           |
       |           |
       +-----------+
      
      The value of page->compound_head is the same for all tail pages.  The
      first page of page structs (page 0) associated with the HugeTLB page
      contains the 4 page structs necessary to describe the HugeTLB.  The only
      use of the remaining pages of page structs (page 1 to page 7) is to point
      to page->compound_head.  Therefore, we can remap pages 2 to 7 to page 1.
      Only 2 pages of page structs will be used for each HugeTLB page.  This
      will allow us to free the remaining 6 pages to the buddy allocator.
      
      Here is how things look after remapping.
      
          HugeTLB                  struct pages(8 pages)         page frame(8 pages)
       +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
       |           |                     |     0     | -------------> |     0     |
       |           |                     +-----------+                +-----------+
       |           |                     |     1     | -------------> |     1     |
       |           |                     +-----------+                +-----------+
       |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
       |           |                     +-----------+                   | | | | |
       |           |                     |     3     | ------------------+ | | | |
       |           |                     +-----------+                     | | | |
       |           |                     |     4     | --------------------+ | | |
       |    2MB    |                     +-----------+                       | | |
       |           |                     |     5     | ----------------------+ | |
       |           |                     +-----------+                         | |
       |           |                     |     6     | ------------------------+ |
       |           |                     +-----------+                           |
       |           |                     |     7     | --------------------------+
       |           |                     +-----------+
       |           |
       |           |
       |           |
       +-----------+
      
      When a HugeTLB is freed to the buddy system, we should allocate 6 pages
      for vmemmap pages and restore the previous mapping relationship.
      
      Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page.  It is similar
      to the 2MB HugeTLB page.  We also can use this approach to free the
      vmemmap pages.
      
      In this case, for the 1GB HugeTLB page, we can save 4094 pages.  This is a
      very substantial gain.  On our server, run some SPDK/QEMU applications
      which will use 1024GB HugeTLB page.  With this feature enabled, we can
      save ~16GB (1G hugepage)/~12GB (2MB hugepage) memory.
      
      Because there are vmemmap page tables reconstruction on the
      freeing/allocating path, it increases some overhead.  Here are some
      overhead analysis.
      
      1) Allocating 10240 2MB HugeTLB pages.
      
         a) With this patch series applied:
         # time echo 10240 > /proc/sys/vm/nr_hugepages
      
         real     0m0.166s
         user     0m0.000s
         sys      0m0.166s
      
         # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; }
           kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs -
           @start[tid]); delete(@start[tid]); }'
         Attaching 2 probes...
      
         @latency:
         [8K, 16K)           5476 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
         [16K, 32K)          4760 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
         [32K, 64K)             4 |                                                    |
      
         b) Without this patch series:
         # time echo 10240 > /proc/sys/vm/nr_hugepages
      
         real     0m0.067s
         user     0m0.000s
         sys      0m0.067s
      
         # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; }
           kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs -
           @start[tid]); delete(@start[tid]); }'
         Attaching 2 probes...
      
         @latency:
         [4K, 8K)           10147 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
         [8K, 16K)             93 |                                                    |
      
         Summarize: this feature is about ~2x slower than before.
      
      2) Freeing 10240 2MB HugeTLB pages.
      
         a) With this patch series applied:
         # time echo 0 > /proc/sys/vm/nr_hugepages
      
         real     0m0.213s
         user     0m0.000s
         sys      0m0.213s
      
         # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; }
           kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs -
           @start[tid]); delete(@start[tid]); }'
         Attaching 2 probes...
      
         @latency:
         [8K, 16K)              6 |                                                    |
         [16K, 32K)         10227 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
         [32K, 64K)             7 |                                                    |
      
         b) Without this patch series:
         # time echo 0 > /proc/sys/vm/nr_hugepages
      
         real     0m0.081s
         user     0m0.000s
         sys      0m0.081s
      
         # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; }
           kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs -
           @start[tid]); delete(@start[tid]); }'
         Attaching 2 probes...
      
         @latency:
         [4K, 8K)            6805 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
         [8K, 16K)           3427 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
         [16K, 32K)             8 |                                                    |
      
         Summary: The overhead of __free_hugepage is about ~2-3x slower than before.
      
      Although the overhead has increased, the overhead is not significant.
      Like Mike said, "However, remember that the majority of use cases create
      HugeTLB pages at or shortly after boot time and add them to the pool.  So,
      additional overhead is at pool creation time.  There is no change to
      'normal run time' operations of getting a page from or returning a page to
      the pool (think page fault/unmap)".
      
      Despite the overhead and in addition to the memory gains from this series.
      The following data is obtained by Joao Martins.  Very thanks to his
      effort.
      
      There's an additional benefit which is page (un)pinners will see an improvement
      and Joao presumes because there are fewer memmap pages and thus the tail/head
      pages are staying in cache more often.
      
      Out of the box Joao saw (when comparing linux-next against linux-next +
      this series) with gup_test and pinning a 16G HugeTLB file (with 1G pages):
      
      	get_user_pages(): ~32k -> ~9k
      	unpin_user_pages(): ~75k -> ~70k
      
      Usually any tight loop fetching compound_head(), or reading tail pages
      data (e.g.  compound_head) benefit a lot.  There's some unpinning
      inefficiencies Joao was fixing[2], but with that in added it shows even
      more:
      
      	unpin_user_pages(): ~27k -> ~3.8k
      
      [1] https://lore.kernel.org/linux-mm/20210409205254.242291-1-mike.kravetz@oracle.com/
      [2] https://lore.kernel.org/linux-mm/20210204202500.26474-1-joao.m.martins@oracle.com/
      
      This patch (of 9):
      
      Move bootmem info registration common API to individual bootmem_info.c.
      And we will use {get,put}_page_bootmem() to initialize the page for the
      vmemmap pages or free the vmemmap pages to buddy in the later patch.  So
      move them out of CONFIG_MEMORY_HOTPLUG_SPARSE.  This is just code movement
      without any functional change.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20210510030027.56044-2-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Tested-by: default avatarChen Huang <chenhuang5@huawei.com>
      Tested-by: default avatarBodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: x86@kernel.org
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      426e5c42
  2. Jun 30, 2021
    • Naoya Horiguchi's avatar
      mm,hwpoison: make get_hwpoison_page() call get_any_page() · 0ed950d1
      Naoya Horiguchi authored
      
      
      __get_hwpoison_page() could fail to grab refcount by some race condition,
      so it's helpful if we can handle it by retrying.  We already have retry
      logic, so make get_hwpoison_page() call get_any_page() when called from
      memory_failure().
      
      As a result, get_hwpoison_page() can return negative values (i.e.  error
      code), so some callers are also changed to handle error cases.
      soft_offline_page() does nothing for -EBUSY because that's enough and
      users in userspace can easily handle it.  unpoison_memory() is also
      unchanged because it's broken and need thorough fixes (will be done
      later).
      
      Link: https://lkml.kernel.org/r/20210603233632.2964832-3-nao.horiguchi@gmail.com
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ed950d1
    • Naoya Horiguchi's avatar
      mm,hwpoison: send SIGBUS with error virutal address · a3f5d80e
      Naoya Horiguchi authored
      
      
      Now an action required MCE in already hwpoisoned address surely sends a
      SIGBUS to current process, but the SIGBUS doesn't convey error virtual
      address.  That's not optimal for hwpoison-aware applications.
      
      To fix the issue, make memory_failure() call kill_accessing_process(),
      that does pagetable walk to find the error virtual address.  It could find
      multiple virtual addresses for the same error page, and it seems hard to
      tell which virtual address is correct one.  But that's rare and sending
      incorrect virtual address could be better than no address.  So let's
      report the first found virtual address for now.
      
      [naoya.horiguchi@nec.com: fix walk_page_range() return]
        Link: https://lkml.kernel.org/r/20210603051055.GA244241@hori.linux.bs1.fc.nec.co.jp
      
      Link: https://lkml.kernel.org/r/20210521030156.2612074-4-nao.horiguchi@gmail.com
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Jue Wang <juew@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3f5d80e
    • Mel Gorman's avatar
      mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes · 203c06ee
      Mel Gorman authored
      
      
      Dave Hansen reported the following about Feng Tang's tests on a machine
      with persistent memory onlined as a DRAM-like device.
      
        Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
        ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
        use" mode and being managed via the buddy just like the normal RAM.
      
        The PMEM zones are big ones:
      
              present  65011712 = 248 G
              high       134595 = 525 M
      
        The PMEM nodes, of course, don't have any CPUs in them.
      
        With your series, the pcp->high value per-cpu is 69584 pages or about
        270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
        worst-case memory in the pcps per zone, or roughly 10% of the size of
        the zone.
      
      This should not cause a problem as such although it could trigger reclaim
      due to pages being stored on per-cpu lists for CPUs remote to a node.  It
      is not possible to treat cpuless nodes exactly the same as normal nodes
      but the worst-case scenario can be mitigated by splitting pcp->high across
      all online CPUs for cpuless memory nodes.
      
      Link: https://lkml.kernel.org/r/20210616110743.GK30378@techsingularity.net
      Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Tang, Feng" <feng.tang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      203c06ee
    • Mel Gorman's avatar
      mm/page_alloc: allow high-order pages to be stored on the per-cpu lists · 44042b44
      Mel Gorman authored
      
      
      The per-cpu page allocator (PCP) only stores order-0 pages.  This means
      that all THP and "cheap" high-order allocations including SLUB contends on
      the zone->lock.  This patch extends the PCP allocator to store THP and
      "cheap" high-order pages.  Note that struct per_cpu_pages increases in
      size to 256 bytes (4 cache lines) on x86-64.
      
      Note that this is not necessarily a universal performance win because of
      how it is implemented.  High-order pages can cause pcp->high to be
      exceeded prematurely for lower-orders so for example, a large number of
      THP pages being freed could release order-0 pages from the PCP lists.
      Hence, much depends on the allocation/free pattern as observed by a single
      CPU to determine if caching helps or hurts a particular workload.
      
      That said, basic performance testing passed.  The following is a netperf
      UDP_STREAM test which hits the relevant patches as some of the network
      allocations are high-order.
      
      netperf-udp
                                       5.13.0-rc2             5.13.0-rc2
                                 mm-pcpburst-v3r4   mm-pcphighorder-v1r7
      Hmean     send-64         261.46 (   0.00%)      266.30 *   1.85%*
      Hmean     send-128        516.35 (   0.00%)      536.78 *   3.96%*
      Hmean     send-256       1014.13 (   0.00%)     1034.63 *   2.02%*
      Hmean     send-1024      3907.65 (   0.00%)     4046.11 *   3.54%*
      Hmean     send-2048      7492.93 (   0.00%)     7754.85 *   3.50%*
      Hmean     send-3312     11410.04 (   0.00%)    11772.32 *   3.18%*
      Hmean     send-4096     13521.95 (   0.00%)    13912.34 *   2.89%*
      Hmean     send-8192     21660.50 (   0.00%)    22730.72 *   4.94%*
      Hmean     send-16384    31902.32 (   0.00%)    32637.50 *   2.30%*
      
      Functionally, a patch like this is necessary to make bulk allocation of
      high-order pages work with similar performance to order-0 bulk
      allocations.  The bulk allocator is not updated in this series as it would
      have to be determined by bulk allocation users how they want to track the
      order of pages allocated with the bulk allocator.
      
      Link: https://lkml.kernel.org/r/20210611135753.GC30378@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      44042b44
    • Mike Rapoport's avatar
      mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM · 43b02ba9
      Mike Rapoport authored
      
      
      After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP
      configuration option is equivalent to FLATMEM.
      
      Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-10-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43b02ba9
    • Mike Rapoport's avatar
      mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA · a9ee6cf5
      Mike Rapoport authored
      
      
      After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
      configuration options are equivalent.
      
      Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.
      
      Done with
      
      	$ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
      		$(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
      	$ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
      		$(git grep -wl NEED_MULTIPLE_NODES)
      
      with manual tweaks afterwards.
      
      [rppt@linux.ibm.com: fix arm boot crash]
        Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9ee6cf5
    • Mike Rapoport's avatar
      docs: remove description of DISCONTIGMEM · 48d9f335
      Mike Rapoport authored
      
      
      Remove description of DISCONTIGMEM from the "Memory Models" document and
      update VM sysctl description so that it won't mention DISCONIGMEM.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-8-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48d9f335
    • Mike Rapoport's avatar
      arch, mm: remove stale mentions of DISCONIGMEM · d3c251ab
      Mike Rapoport authored
      
      
      There are several places that mention DISCONIGMEM in comments or have
      stale code guarded by CONFIG_DISCONTIGMEM.
      
      Remove the dead code and update the comments.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-7-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3c251ab
    • Mike Rapoport's avatar
      mm: remove CONFIG_DISCONTIGMEM · bb1c50d3
      Mike Rapoport authored
      
      
      There are no architectures that support DISCONTIGMEM left.
      
      Remove the configuration option and the dead code it was guarding in the
      generic memory management code.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-6-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb1c50d3
    • Mike Rapoport's avatar
      m68k: remove support for DISCONTIGMEM · 5ab06e10
      Mike Rapoport authored
      
      
      DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
      in v5.11.
      
      Remove the support for DISCONTIGMEM entirely.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-5-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ab06e10
    • Mike Rapoport's avatar
      arc: remove support for DISCONTIGMEM · 8b793b44
      Mike Rapoport authored
      
      
      DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
      in v5.11.
      
      Remove the support for DISCONTIGMEM entirely.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-4-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b793b44
    • Mike Rapoport's avatar
      arc: update comment about HIGHMEM implementation · e7793e53
      Mike Rapoport authored
      
      
      Arc does not use DISCONTIGMEM to implement high memory, update the comment
      describing how high memory works to reflect this.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-3-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7793e53
    • Mike Rapoport's avatar
      alpha: remove DISCONTIGMEM and NUMA · fdb7d9b7
      Mike Rapoport authored
      
      
      Patch series "Remove DISCONTIGMEM memory model", v3.
      
      SPARSEMEM memory model was supposed to entirely replace DISCONTIGMEM a
      (long) while ago.  The last architectures that used DISCONTIGMEM were
      updated to use other memory models in v5.11 and it is about the time to
      entirely remove DISCONTIGMEM from the kernel.
      
      This set removes DISCONTIGMEM from alpha, arc and m68k, simplifies memory
      model selection in mm/Kconfig and replaces usage of redundant
      CONFIG_NEED_MULTIPLE_NODES and CONFIG_FLAT_NODE_MEM_MAP with CONFIG_NUMA
      and CONFIG_FLATMEM respectively.
      
      I've also removed NUMA support on alpha that was BROKEN for more than 15
      years.
      
      There were also minor updates all over arch/ to remove mentions of
      DISCONTIGMEM in comments and #ifdefs.
      
      This patch (of 9):
      
      NUMA is marked broken on alpha for more than 15 years and DISCONTIGMEM was
      replaced with SPARSEMEM in v5.11.
      
      Remove both NUMA and DISCONTIGMEM support from alpha.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20210608091316.3622-2-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fdb7d9b7
    • Mel Gorman's avatar
      mm/page_alloc: move free_the_page · 21d02f8f
      Mel Gorman authored
      
      
      Patch series "Allow high order pages to be stored on PCP", v2.
      
      The per-cpu page allocator (PCP) only handles order-0 pages.  With the
      series "Use local_lock for pcp protection and reduce stat overhead" and
      "Calculate pcp->high based on zone sizes and active CPUs", it's now
      feasible to store high-order pages on PCP lists.
      
      This small series allows PCP to store "cheap" orders where cheap is
      determined by PAGE_ALLOC_COSTLY_ORDER and THP-sized allocations.
      
      This patch (of 2):
      
      In the next page, free_compount_page is going to use the common helper
      free_the_page.  This patch moves the definition to ease review.  No
      functional change.
      
      Link: https://lkml.kernel.org/r/20210603142220.10851-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20210603142220.10851-2-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21d02f8f
    • Liu Shixin's avatar
      mm/page_alloc: fix counting of managed_pages · f7ec1044
      Liu Shixin authored
      commit f6366156 ("mm/page_alloc.c: clear out zone->lowmem_reserve[] if
      the zone is empty") clears out zone->lowmem_reserve[] if zone is empty.
      But when zone is not empty and sysctl_lowmem_reserve_ratio[i] is set to
      zero, zone_managed_pages(zone) is not counted in the managed_pages either.
      This is inconsistent with the description of lowmem_reserve, so fix it.
      
      Link: https://lkml.kernel.org/r/20210527125707.3760259-1-liushixin2@huawei.com
      Fixes: f6366156
      
       ("mm/page_alloc.c: clear out zone->lowmem_reserve[] if the zone is empty")
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Reported-by: default avataryangerkun <yangerkun@huawei.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7ec1044
    • Dong Aisheng's avatar
      mm/page_alloc: improve memmap_pages dbg msg · e47aa905
      Dong Aisheng authored
      
      
      Make debug message more accurate.
      
      Link: https://lkml.kernel.org/r/20210531091908.1738465-6-aisheng.dong@nxp.com
      Signed-off-by: default avatarDong Aisheng <aisheng.dong@nxp.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e47aa905
    • Dong Aisheng's avatar
      mm: drop SECTION_SHIFT in code comments · 777c00f5
      Dong Aisheng authored
      Actually SECTIONS_SHIFT is used in the kernel code, so the code comments
      is strictly incorrect.  And since commit bbeae5b0
      
       ("mm: move page
      flags layout to separate header"), SECTIONS_SHIFT definition has been
      moved to include/linux/page-flags-layout.h, since code itself looks quite
      straighforward, instead of moving the code comment into the new place as
      well, we just simply remove it.
      
      This also fixed a checkpatch complain derived from the original code:
      WARNING: please, no space before tabs
      + * SECTIONS_SHIFT    ^I^I#bits space required to store a section #$
      
      Link: https://lkml.kernel.org/r/20210531091908.1738465-2-aisheng.dong@nxp.com
      Signed-off-by: default avatarDong Aisheng <aisheng.dong@nxp.com>
      Suggested-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      777c00f5
    • Mel Gorman's avatar
      mm/page_alloc: introduce vm.percpu_pagelist_high_fraction · 74f44822
      Mel Gorman authored
      
      
      This introduces a new sysctl vm.percpu_pagelist_high_fraction.  It is
      similar to the old vm.percpu_pagelist_fraction.  The old sysctl increased
      both pcp->batch and pcp->high with the higher pcp->high potentially
      reducing zone->lock contention.  However, the higher pcp->batch value also
      potentially increased allocation latency while the PCP was refilled.  This
      sysctl only adjusts pcp->high so that zone->lock contention is potentially
      reduced but allocation latency during a PCP refill remains the same.
      
        # grep -E "high:|batch" /proc/zoneinfo | tail -2
                    high:  649
                    batch: 63
      
        # sysctl vm.percpu_pagelist_high_fraction=8
        # grep -E "high:|batch" /proc/zoneinfo | tail -2
                    high:  35071
                    batch: 63
      
        # sysctl vm.percpu_pagelist_high_fraction=64
                    high:  4383
                    batch: 63
      
        # sysctl vm.percpu_pagelist_high_fraction=0
                    high:  649
                    batch: 63
      
      [mgorman@techsingularity.net: fix documentation]
        Link: https://lkml.kernel.org/r/20210528151010.GQ30378@techsingularity.net
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-7-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74f44822
    • Mel Gorman's avatar
      mm/page_alloc: limit the number of pages on PCP lists when reclaim is active · c49c2c47
      Mel Gorman authored
      
      
      When kswapd is active then direct reclaim is potentially active.  In
      either case, it is possible that a zone would be balanced if pages were
      not trapped on PCP lists.  Instead of draining remote pages, simply limit
      the size of the PCP lists while kswapd is active.
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-6-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c49c2c47
    • Mel Gorman's avatar
      mm/page_alloc: scale the number of pages that are batch freed · 3b12e7e9
      Mel Gorman authored
      
      
      When a task is freeing a large number of order-0 pages, it may acquire the
      zone->lock multiple times freeing pages in batches.  This may
      unnecessarily contend on the zone lock when freeing very large number of
      pages.  This patch adapts the size of the batch based on the recent
      pattern to scale the batch size for subsequent frees.
      
      As the machines I used were not large enough to test this are not large
      enough to illustrate a problem, a debugging patch shows patterns like the
      following (slightly editted for clarity)
      
      Baseline vanilla kernel
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
      
      With patches
        time-unmap-7724    [...] free_pcppages_bulk: free  126 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  252 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  504 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  751 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  751 count  814 high  814
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-5-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b12e7e9
    • Mel Gorman's avatar
      mm/page_alloc: adjust pcp->high after CPU hotplug events · 04f8cfea
      Mel Gorman authored
      
      
      The PCP high watermark is based on the number of online CPUs so the
      watermarks must be adjusted during CPU hotplug.  At the time of
      hot-remove, the number of online CPUs is already adjusted but during
      hot-add, a delta needs to be applied to update PCP to the correct value.
      After this patch is applied, the high watermarks are adjusted correctly.
      
        # grep high: /proc/zoneinfo  | tail -1
                    high:  649
        # echo 0 > /sys/devices/system/cpu/cpu4/online
        # grep high: /proc/zoneinfo  | tail -1
                    high:  664
        # echo 1 > /sys/devices/system/cpu/cpu4/online
        # grep high: /proc/zoneinfo  | tail -1
                    high:  649
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-4-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04f8cfea
    • Mel Gorman's avatar
      mm/page_alloc: disassociate the pcp->high from pcp->batch · b92ca18e
      Mel Gorman authored
      
      
      The pcp high watermark is based on the batch size but there is no
      relationship between them other than it is convenient to use early in
      boot.
      
      This patch takes the first step and bases pcp->high on the zone low
      watermark split across the number of CPUs local to a zone while the batch
      size remains the same to avoid increasing allocation latencies.  The
      intent behind the default pcp->high is "set the number of PCP pages such
      that if they are all full that background reclaim is not started
      prematurely".
      
      Note that in this patch the pcp->high values are adjusted after memory
      hotplug events, min_free_kbytes adjustments and watermark scale factor
      adjustments but not CPU hotplug events which is handled later in the
      series.
      
      On a test KVM instance;
      
      Before grep -E "high:|batch" /proc/zoneinfo | tail -2
                    high:  378
                    batch: 63
      
      After grep -E "high:|batch" /proc/zoneinfo | tail -2
                    high:  649
                    batch: 63
      
      [mgorman@techsingularity.net:  fix __setup_per_zone_wmarks for parallel memory
      hotplug]
        Link: https://lkml.kernel.org/r/20210528105925.GN30378@techsingularity.net
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-3-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b92ca18e
    • Mel Gorman's avatar
      mm/page_alloc: delete vm.percpu_pagelist_fraction · bbbecb35
      Mel Gorman authored
      
      
      Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2.
      
      The per-cpu page allocator (PCP) is meant to reduce contention on the zone
      lock but the sizing of batch and high is archaic and neither takes the
      zone size into account or the number of CPUs local to a zone.  With larger
      zones and more CPUs per node, the contention is getting worse.
      Furthermore, the fact that vm.percpu_pagelist_fraction adjusts both batch
      and high values means that the sysctl can reduce zone lock contention but
      also increase allocation latencies.
      
      This series disassociates pcp->high from pcp->batch and then scales
      pcp->high based on the size of the local zone with limited impact to
      reclaim and accounting for active CPUs but leaves pcp->batch static.  It
      also adapts the number of pages that can be on the pcp list based on
      recent freeing patterns.
      
      The motivation is partially to adjust to larger memory sizes but is also
      driven by the fact that large batches of page freeing via release_pages()
      often shows zone contention as a major part of the problem.  Another is a
      bug report based on an older kernel where a multi-terabyte process can
      takes several minutes to exit.  A workaround was to use
      vm.percpu_pagelist_fraction to increase the pcp->high value but testing
      indicated that a production workload could not use the same values because
      of an increase in allocation latencies.  Unfortunately, I cannot reproduce
      this test case myself as the multi-terabyte machines are in active use but
      it should alleviate the problem.
      
      The series aims to address both and partially acts as a pre-requisite.
      pcp only works with order-0 which is useless for SLUB (when using high
      orders) and THP (unconditionally).  To store high-order pages on PCP, the
      pcp->high values need to be increased first.
      
      This patch (of 6):
      
      The vm.percpu_pagelist_fraction is used to increase the batch and high
      limits for the per-cpu page allocator (PCP).  The intent behind the sysctl
      is to reduce zone lock acquisition when allocating/freeing pages but it
      has a problem.  While it can decrease contention, it can also increase
      latency on the allocation side due to unreasonably large batch sizes.
      This leads to games where an administrator adjusts
      percpu_pagelist_fraction on the fly to work around contention and
      allocation latency problems.
      
      This series aims to alleviate the problems with zone lock contention while
      avoiding the allocation-side latency problems.  For the purposes of
      review, it's easier to remove this sysctl now and reintroduce a similar
      sysctl later in the series that deals only with pcp->high.
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20210525080119.5455-2-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbbecb35