Skip to content
  1. Mar 23, 2022
    • Miaohe Lin's avatar
      mm: remove unneeded local variable follflags · 87d2762e
      Miaohe Lin authored
      
      
      We can pass FOLL_GET | FOLL_DUMP to follow_page directly to simplify the
      code a bit in add_page_for_migration and split_huge_pages_pid.
      
      Link: https://lkml.kernel.org/r/20220311072002.35575-1-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      87d2762e
    • David Howells's avatar
      mm/hugetlb.c: export PageHeadHuge() · 4e936ecc
      David Howells authored
      
      
      Export PageHeadHuge() - it's used by folio_test_hugetlb() and thence by
      such as folio_file_page() and folio_contains().  Matthew suggested I use
      the first of those instead of doing the same calculation manually - but I
      can't call it from a module.
      
      Kirill suggested rearranging things to put it in a header, but that
      introduces header dependencies because of where constants are defined.
      
      [akpm@linux-foundation.org: s/EXPORT_SYMBOL/EXPORT_SYMBOL_GPL/, per Christoph]
      
      Link: https://lkml.kernel.org/r/2494562.1646054576@warthog.procyon.org.uk
      Link: https://lore.kernel.org/r/163707085314.3221130.14783857863702203440.stgit@warthog.procyon.org.uk/
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e936ecc
    • Miaohe Lin's avatar
      mm/hugetlb: use helper macro __ATTR_RW · 98bc26ac
      Miaohe Lin authored
      
      
      Use helper macro __ATTR_RW to define HSTATE_ATTR to make code more clear.
      Minor readability improvement.
      
      Link: https://lkml.kernel.org/r/20220222112731.33479-1-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98bc26ac
    • Mike Kravetz's avatar
      hugetlb: clean up potential spectre issue warnings · f9317f77
      Mike Kravetz authored
      
      
      Recently introduced code allows numa nodes to be specified on the kernel
      command line for hugetlb allocations or CMA reservations.  The node
      values are user specified and used as indicies into arrays.  This
      generated the following smatch warnings:
      
        mm/hugetlb.c:4170 hugepages_setup() warn: potential spectre issue 'default_hugepages_in_node' [w]
        mm/hugetlb.c:4172 hugepages_setup() warn: potential spectre issue 'parsed_hstate->max_huge_pages_node' [w]
        mm/hugetlb.c:6898 cmdline_parse_hugetlb_cma() warn: potential spectre issue 'hugetlb_cma_size_in_node' [w] (local cap)
      
      Clean up by using array_index_nospec to sanitize array indicies.
      
      The routine cmdline_parse_hugetlb_cma has the same overflow/truncation
      issue addressed in [1].  That is also fixed with this change.
      
      [1] https://lore.kernel.org/linux-mm/20220209134018.8242-1-liuyuntao10@huawei.com/
      
      As Michal pointed out, this is unlikely to be exploitable because it is
      __init code.  But the patch suppresses the warnings.
      
      [mike.kravetz@oracle.com: v2]
        Link: https://lkml.kernel.org/r/20220218212946.35441-1-mike.kravetz@oracle.com
      
      Link: https://lkml.kernel.org/r/20220217234218.192885-1-mike.kravetz@oracle.com
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Zhenguo Yao <yaozhenguo1@gmail.com>
      Cc: Liu Yuntao <liuyuntao10@huawei.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9317f77
    • Anshuman Khandual's avatar
      mm/hugetlb: generalize ARCH_WANT_GENERAL_HUGETLB · 07431506
      Anshuman Khandual authored
      
      
      ARCH_WANT_GENERAL_HUGETLB config has duplicate definitions on platforms
      that subscribe it.  Instead make it a generic config option which can be
      selected on applicable platforms when required.
      
      Link: https://lkml.kernel.org/r/1643718465-4324-1-git-send-email-anshuman.khandual@arm.com
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07431506
    • Muchun Song's avatar
      mm: sparsemem: move vmemmap related to HugeTLB to CONFIG_HUGETLB_PAGE_FREE_VMEMMAP · e5408417
      Muchun Song authored
      
      
      The vmemmap_remap_free/alloc are relevant to HugeTLB, so move those
      functiongs to the scope of CONFIG_HUGETLB_PAGE_FREE_VMEMMAP.
      
      Link: https://lkml.kernel.org/r/20211101031651.75851-6-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5408417
    • Muchun Song's avatar
      selftests: vm: add a hugetlb test case · b147c89c
      Muchun Song authored
      
      
      Since the head vmemmap page frame associated with each HugeTLB page is
      reused, we should hide the PG_head flag of tail struct page from the
      user.  Add a tese case to check whether it is work properly.  The test
      steps are as follows.
      
        1) alloc 2MB hugeTLB
        2) get each page frame
        3) apply those APIs in each page frame
        4) Those APIs work completely the same as before.
      
      Reading the flags of a page by /proc/kpageflags is done in
      stable_page_flags(), which has invoked PageHead(), PageTail(),
      PageCompound() and compound_head().
      
      If those APIs work properly, the head page must have 15 and 17 bits set.
      And tail pages must have 16 and 17 bits set but 15 bit unset.  Those
      flags are checked in check_page_flags().
      
      Link: https://lkml.kernel.org/r/20211101031651.75851-5-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b147c89c
    • Muchun Song's avatar
      mm: sparsemem: use page table lock to protect kernel pmd operations · d8d55f56
      Muchun Song authored
      
      
      The init_mm.page_table_lock is used to protect kernel page tables, we
      can use it to serialize splitting vmemmap PMD mappings instead of mmap
      write lock, which can increase the concurrency of vmemmap_remap_free().
      
      Actually, It increase the concurrency between allocations of HugeTLB
      pages.  But it is not the only benefit.  There are a lot of users of
      mmap read lock of init_mm.  The mmap write lock is holding through
      vmemmap_remap_free(), removing mmap write lock usage to make it does not
      affect other users of mmap read lock.  It is not making anything worse
      and always a win to move.
      
      Now the kernel page table walker does not hold the page_table_lock when
      walking pmd entries.  There may be consistency issue of a pmd entry,
      because pmd entry might change from a huge pmd entry to a PTE page
      table.  There is only one user of kernel page table walker, namely
      ptdump.  The ptdump already considers the consistency, which use a local
      variable to cache the value of pmd entry.  But we also need to update
      ->action to ACTION_CONTINUE to make sure the walker does not walk every
      pte entry again when concurrent thread has split the huge pmd.
      
      Link: https://lkml.kernel.org/r/20211101031651.75851-4-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8d55f56
    • Muchun Song's avatar
      mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key · a6b40850
      Muchun Song authored
      
      
      The page_fixed_fake_head() is used throughout memory management and the
      conditional check requires checking a global variable, although the
      overhead of this check may be small, it increases when the memory cache
      comes under pressure.  Also, the global variable will not be modified
      after system boot, so it is very appropriate to use static key machanism.
      
      Link: https://lkml.kernel.org/r/20211101031651.75851-3-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6b40850
    • Muchun Song's avatar
      mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB page · e7d32485
      Muchun Song authored
      
      
      Patch series "Free the 2nd vmemmap page associated with each HugeTLB
      page", v7.
      
      This series can minimize the overhead of struct page for 2MB HugeTLB
      pages significantly.  It further reduces the overhead of struct page by
      12.5% for a 2MB HugeTLB compared to the previous approach, which means
      2GB per 1TB HugeTLB.  It is a nice gain.  Comments and reviews are
      welcome.  Thanks.
      
      The main implementation and details can refer to the commit log of patch
      1.  In this series, I have changed the following four helpers, the
      following table shows the impact of the overhead of those helpers.
      
      	+------------------+-----------------------+
      	|       APIs       | head page | tail page |
      	+------------------+-----------+-----------+
      	|    PageHead()    |     Y     |     N     |
      	+------------------+-----------+-----------+
      	|    PageTail()    |     Y     |     N     |
      	+------------------+-----------+-----------+
      	|  PageCompound()  |     N     |     N     |
      	+------------------+-----------+-----------+
      	|  compound_head() |     Y     |     N     |
      	+------------------+-----------+-----------+
      
      	Y: Overhead is increased.
      	N: Overhead is _NOT_ increased.
      
      It shows that the overhead of those helpers on a tail page don't change
      between "hugetlb_free_vmemmap=on" and "hugetlb_free_vmemmap=off".  But the
      overhead on a head page will be increased when "hugetlb_free_vmemmap=on"
      (except PageCompound()).  So I believe that Matthew Wilcox's folio series
      will help with this.
      
      The users of PageHead() and PageTail() are much less than compound_head()
      and most users of PageTail() are VM_BUG_ON(), so I have done some tests
      about the overhead of compound_head() on head pages.
      
      I have tested the overhead of calling compound_head() on a head page,
      which is 2.11ns (Measure the call time of 10 million times
      compound_head(), and then average).
      
      For a head page whose address is not aligned with PAGE_SIZE or a
      non-compound page, the overhead of compound_head() is 2.54ns which is
      increased by 20%.  For a head page whose address is aligned with
      PAGE_SIZE, the overhead of compound_head() is 2.97ns which is increased by
      40%.  Most pages are the former.  I do not think the overhead is
      significant since the overhead of compound_head() itself is low.
      
      This patch (of 5):
      
      This patch minimizes the overhead of struct page for 2MB HugeTLB pages
      significantly.  It further reduces the overhead of struct page by 12.5%
      for a 2MB HugeTLB compared to the previous approach, which means 2GB per
      1TB HugeTLB (2MB type).
      
      After the feature of "Free sonme vmemmap pages of HugeTLB page" is
      enabled, the mapping of the vmemmap addresses associated with a 2MB
      HugeTLB page becomes the figure below.
      
           HugeTLB                    struct pages(8 pages)         page frame(8 pages)
       +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+---> PG_head
       |           |                     |     0     | -------------> |     0     |
       |           |                     +-----------+                +-----------+
       |           |                     |     1     | -------------> |     1     |
       |           |                     +-----------+                +-----------+
       |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
       |           |                     +-----------+                   | | | | |
       |           |                     |     3     | ------------------+ | | | |
       |           |                     +-----------+                     | | | |
       |           |                     |     4     | --------------------+ | | |
       |    2MB    |                     +-----------+                       | | |
       |           |                     |     5     | ----------------------+ | |
       |           |                     +-----------+                         | |
       |           |                     |     6     | ------------------------+ |
       |           |                     +-----------+                           |
       |           |                     |     7     | --------------------------+
       |           |                     +-----------+
       |           |
       |           |
       |           |
       +-----------+
      
      As we can see, the 2nd vmemmap page frame (indexed by 1) is reused and
      remaped. However, the 2nd vmemmap page frame is also can be freed to
      the buddy allocator, then we can change the mapping from the figure
      above to the figure below.
      
          HugeTLB                    struct pages(8 pages)         page frame(8 pages)
       +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+---> PG_head
       |           |                     |     0     | -------------> |     0     |
       |           |                     +-----------+                +-----------+
       |           |                     |     1     | ---------------^ ^ ^ ^ ^ ^ ^
       |           |                     +-----------+                  | | | | | |
       |           |                     |     2     | -----------------+ | | | | |
       |           |                     +-----------+                    | | | | |
       |           |                     |     3     | -------------------+ | | | |
       |           |                     +-----------+                      | | | |
       |           |                     |     4     | ---------------------+ | | |
       |    2MB    |                     +-----------+                        | | |
       |           |                     |     5     | -----------------------+ | |
       |           |                     +-----------+                          | |
       |           |                     |     6     | -------------------------+ |
       |           |                     +-----------+                            |
       |           |                     |     7     | ---------------------------+
       |           |                     +-----------+
       |           |
       |           |
       |           |
       +-----------+
      
      After we do this, all tail vmemmap pages (1-7) are mapped to the head
      vmemmap page frame (0).  In other words, there are more than one page
      struct with PG_head associated with each HugeTLB page.  We __know__ that
      there is only one head page struct, the tail page structs with PG_head are
      fake head page structs.  We need an approach to distinguish between those
      two different types of page structs so that compound_head(), PageHead()
      and PageTail() can work properly if the parameter is the tail page struct
      but with PG_head.
      
      The following code snippet describes how to distinguish between real and
      fake head page struct.
      
      	if (test_bit(PG_head, &page->flags)) {
      		unsigned long head = READ_ONCE(page[1].compound_head);
      
      		if (head & 1) {
      			if (head == (unsigned long)page + 1)
      				==> head page struct
      			else
      				==> tail page struct
      		} else
      			==> head page struct
      	}
      
      We can safely access the field of the @page[1] with PG_head because the
      @page is a compound page composed with at least two contiguous pages.
      
      [songmuchun@bytedance.com: restore lost comment changes]
      
      Link: https://lkml.kernel.org/r/20211101031651.75851-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20211101031651.75851-2-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7d32485
    • Miaohe Lin's avatar
      mm/mlock: fix potential imbalanced rlimit ucounts adjustment · 5c2a956c
      Miaohe Lin authored
      user_shm_lock forgets to set allowed to 0 when get_ucounts fails.  So
      the later user_shm_unlock might do the extra dec_rlimit_ucounts.  Fix
      this by resetting allowed to 0.
      
      Link: https://lkml.kernel.org/r/20220310132417.41189-1-linmiaohe@huawei.com
      Fixes: d7c9e99a
      
       ("Reimplement RLIMIT_MEMLOCK on top of ucounts")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c2a956c
    • Vlastimil Babka's avatar
      mm, fault-injection: declare should_fail_alloc_page() · 1e7a8181
      Vlastimil Babka authored
      The mm/ directory can almost fully be built with W=1, which would help
      in local development.  One remaining issue is missing prototype for
      should_fail_alloc_page().  Thus add it next to the should_failslab()
      prototype.
      
      Note the previous attempt by commit f7173090 ("mm/page_alloc: make
      should_fail_alloc_page() static") had to be reverted by commit
      54aa3866
      
       as it caused an unresolved symbol error with
      CONFIG_DEBUG_INFO_BTF=y
      
      Link: https://lkml.kernel.org/r/20220314165724.16071-1-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e7a8181
    • Miaohe Lin's avatar
      mm/memory-failure.c: make non-LRU movable pages unhandlable · bf6445bc
      Miaohe Lin authored
      
      
      We can not really handle non-LRU movable pages in memory failure.
      Typically they are balloon, zsmalloc, etc.
      
      Assuming we run into a base (4K) non-LRU movable page, we could reach as
      far as identify_page_state(), it should not fall into any category
      except me_unknown.
      
      For the non-LRU compound movable pages, they could be taken for
      transhuge pages but it's unexpected to split non-LRU movable pages using
      split_huge_page_to_list in memory_failure.  So we could just simply make
      non-LRU movable pages unhandlable to avoid these possible nasty cases.
      
      Link: https://lkml.kernel.org/r/20220312074613.4798-4-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Suggested-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf6445bc
    • Miaohe Lin's avatar
      mm/memory-failure.c: avoid calling invalidate_inode_page() with unexpected pages · 593396b8
      Miaohe Lin authored
      
      
      Since commit 042c4f32323b ("mm/truncate: Inline invalidate_complete_page()
      into its one caller"), invalidate_inode_page() can invalidate the pages
      in the swap cache because the check of page->mapping != mapping is
      removed.  But invalidate_inode_page() is not expected to deal with the
      pages in swap cache.  Also non-lru movable page can reach here too.
      They're not page cache pages.  Skip these pages by checking
      PageSwapCache and PageLRU.
      
      Link: https://lkml.kernel.org/r/20220312074613.4798-3-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      593396b8
    • Miaohe Lin's avatar
      mm/memory-failure.c: fix race with changing page compound again · 888af270
      Miaohe Lin authored
      
      
      Patch series "A few fixup patches for memory failure", v2.
      
      This series contains a few patches to fix the race with changing page
      compound page, make non-LRU movable pages unhandlable and so on.  More
      details can be found in the respective changelogs.
      
      There is a race window where we got the compound_head, the hugetlb page
      could be freed to buddy, or even changed to another compound page just
      before we try to get hwpoison page.  Think about the below race window:
      
        CPU 1					  CPU 2
        memory_failure_hugetlb
        struct page *head = compound_head(p);
      					  hugetlb page might be freed to
      					  buddy, or even changed to another
      					  compound page.
      
        get_hwpoison_page -- page is not what we want now...
      
      If this race happens, just bail out.  Also MF_MSG_DIFFERENT_PAGE_SIZE is
      introduced to record this event.
      
      [akpm@linux-foundation.org: s@/**@/*@, per Naoya Horiguchi]
      
      Link: https://lkml.kernel.org/r/20220312074613.4798-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20220312074613.4798-2-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      888af270
    • luofei's avatar
      mm/hwpoison: add in-use hugepage hwpoison filter judgement · a06ad3c0
      luofei authored
      
      
      After successfully obtaining the reference count of the huge page, it is
      still necessary to call hwpoison_filter() to make a filter judgement,
      otherwise the filter hugepage will be unmaped and the related process
      may be killed.
      
      Link: https://lkml.kernel.org/r/20220223082254.2769757-1-luofei@unicloud.com
      Signed-off-by: default avatarluofei <luofei@unicloud.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a06ad3c0
    • luofei's avatar
      mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler · d1fe111f
      luofei authored
      
      
      When the hwpoison page meets the filter conditions, it should not be
      regarded as successful memory_failure() processing for mce handler, but
      should return a distinct value, otherwise mce handler regards the error
      page has been identified and isolated, which may lead to calling
      set_mce_nospec() to change page attribute, etc.
      
      Here memory_failure() return -EOPNOTSUPP to indicate that the error
      event is filtered, mce handler should not take any action for this
      situation and hwpoison injector should treat as correct.
      
      Link: https://lkml.kernel.org/r/20220223082135.2769649-1-luofei@unicloud.com
      Signed-off-by: default avatarluofei <luofei@unicloud.com>
      Acked-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1fe111f
    • Miaohe Lin's avatar
      mm/hwpoison-inject: support injecting hwpoison to free page · a581865e
      Miaohe Lin authored
      
      
      memory_failure() can handle free buddy page.  Support injecting hwpoison
      to free page by adding is_free_buddy_page check when hwpoison filter is
      disabled.
      
      [akpm@linux-foundation.org: export is_free_buddy_page() to modules]
      
      Link: https://lkml.kernel.org/r/20220218092052.3853-1-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a581865e
    • Miaohe Lin's avatar
      mm/memory-failure.c: remove unnecessary PageTransTail check · b04d3eeb
      Miaohe Lin authored
      
      
      When we reach here, we're guaranteed to have non-compound page as thp is
      already splited.  Remove this unnecessary PageTransTail check.
      
      Link: https://lkml.kernel.org/r/20220218090118.1105-9-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b04d3eeb
    • Miaohe Lin's avatar
      mm/memory-failure.c: remove obsolete comment in __soft_offline_page · 2ab91679
      Miaohe Lin authored
      Since commit add05cec
      
       ("mm: soft-offline: don't free target page in
      successful page migration"), set_migratetype_isolate logic is removed.
      Remove this obsolete comment.
      
      Link: https://lkml.kernel.org/r/20220218090118.1105-8-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ab91679
    • Miaohe Lin's avatar
      mm/memory-failure.c: rework the try_to_unmap logic in hwpoison_user_mappings() · 357670f7
      Miaohe Lin authored
      
      
      Only for hugetlb pages in shared mappings, try_to_unmap should take
      semaphore in write mode here.  Rework the code to make it clear.
      
      Link: https://lkml.kernel.org/r/20220218090118.1105-7-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      357670f7
    • Miaohe Lin's avatar
      mm/memory-failure.c: remove PageSlab check in hwpoison_filter_dev · 67ff51c6
      Miaohe Lin authored
      Since commit 03e5ac2f
      
       ("mm: fix crash when using XFS on loopback"),
      page_mapping() can handle the Slab pages.  So remove this unnecessary
      PageSlab check and obsolete comment.
      
      Link: https://lkml.kernel.org/r/20220218090118.1105-6-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67ff51c6
    • Miaohe Lin's avatar
      mm/memory-failure.c: fix race with changing page more robustly · 75ee64b3
      Miaohe Lin authored
      
      
      We're only intended to deal with the non-Compound page after we split
      thp in memory_failure.  However, the page could have changed compound
      pages due to race window.  If this happens, we could retry once to
      hopefully handle the page next round.  Also remove unneeded orig_head.
      It's always equal to the hpage.  So we can use hpage directly and remove
      this redundant one.
      
      Link: https://lkml.kernel.org/r/20220218090118.1105-5-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75ee64b3
    • Miaohe Lin's avatar
      mm/memory-failure.c: rework the signaling logic in kill_proc · 49775047
      Miaohe Lin authored
      
      
      BUS_MCEERR_AR code is only sent when MF_ACTION_REQUIRED is set and the
      target is current.  Rework the code to make this clear.
      
      Link: https://lkml.kernel.org/r/20220218090118.1105-4-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49775047
    • Miaohe Lin's avatar
      mm/memory-failure.c: catch unexpected -EFAULT from vma_address() · a994402b
      Miaohe Lin authored
      
      
      It's unexpected to walk the page table when vma_address() return
      -EFAULT.  But dev_pagemap_mapping_shift() is called only when vma
      associated to the error page is found already in
      collect_procs_{file,anon}, so vma_address() should not return -EFAULT
      except with some bug, as Naoya pointed out.  We can use VM_BUG_ON_VMA()
      to catch this bug here.
      
      Link: https://lkml.kernel.org/r/20220218090118.1105-3-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a994402b
    • Miaohe Lin's avatar
      mm/memory-failure.c: minor clean up for memory_failure_dev_pagemap · 577553f4
      Miaohe Lin authored
      
      
      Patch series "A few cleanup and fixup patches for memory failure", v3.
      
      This series contains a few patches to simplify the code logic, remove
      unneeded variable and remove obsolete comment.  Also we fix race
      changing page more robustly in memory_failure.  More details can be
      found in the respective changelogs.
      
      This patch (of 8):
      
      The flags always has MF_ACTION_REQUIRED and MF_MUST_KILL set.  So we do
      not need to check these flags again.
      
      Link: https://lkml.kernel.org/r/20220218090118.1105-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20220218090118.1105-2-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      577553f4
    • Rik van Riel's avatar
      mm: invalidate hwpoison page cache page in fault path · e53ac737
      Rik van Riel authored
      
      
      Sometimes the page offlining code can leave behind a hwpoisoned clean
      page cache page.  This can lead to programs being killed over and over
      and over again as they fault in the hwpoisoned page, get killed, and
      then get re-spawned by whatever wanted to run them.
      
      This is particularly embarrassing when the page was offlined due to
      having too many corrected memory errors.  Now we are killing tasks due
      to them trying to access memory that probably isn't even corrupted.
      
      This problem can be avoided by invalidating the page from the page fault
      handler, which already has a branch for dealing with these kinds of
      pages.  With this patch we simply pretend the page fault was successful
      if the page was invalidated, return to userspace, incur another page
      fault, read in the file from disk (to a new memory page), and then
      everything works again.
      
      Link: https://lkml.kernel.org/r/20220212213740.423efcea@imladris.surriel.com
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e53ac737
    • Naoya Horiguchi's avatar
      mm/hwpoison: fix error page recovered but reported "not recovered" · 046545a6
      Naoya Horiguchi authored
      When an uncorrected memory error is consumed there is a race between the
      CMCI from the memory controller reporting an uncorrected error with a
      UCNA signature, and the core reporting and SRAR signature machine check
      when the data is about to be consumed.
      
      If the CMCI wins that race, the page is marked poisoned when
      uc_decode_notifier() calls memory_failure() and the machine check
      processing code finds the page already poisoned.  It calls
      kill_accessing_process() to make sure a SIGBUS is sent.  But returns the
      wrong error code.
      
      Console log looks like this:
      
        mce: Uncorrected hardware memory error in user-access at 3710b3400
        Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered
        Memory failure: 0x3710b3: already hardware poisoned
        Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption
        mce: Memory error not recovered
      
      kill_accessing_process() is supposed to return -EHWPOISON to notify that
      SIGBUS is already set to the process and kill_me_maybe() doesn't have to
      send it again.  But current code simply fails to do this, so fix it to
      make sure to work as intended.  This change avoids the noise message
      "Memory error not recovered" and skips duplicate SIGBUSs.
      
      [tony.luck@intel.com: reword some parts of commit message]
      
      Link: https://lkml.kernel.org/r/20220113231117.1021405-1-naoya.horiguchi@linux.dev
      Fixes: a3f5d80e
      
       ("mm,hwpoison: send SIGBUS with error virutal address")
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reported-by: default avatarYouquan Song <youquan.song@intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      046545a6
    • Naoya Horiguchi's avatar
      mm/memory-failure.c: remove obsolete comment · ae483c20
      Naoya Horiguchi authored
      
      
      With the introduction of mf_mutex, most of memory error handling process
      is mutually exclusive, so the in-line comment about subtlety about
      double-checking PageHWPoison is no more correct.  So remove it.
      
      Link: https://lkml.kernel.org/r/20220125025601.3054511-1-naoya.horiguchi@linux.dev
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Suggested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae483c20
    • Mel Gorman's avatar
      mm/page_alloc: check high-order pages for corruption during PCP operations · 77fe7f13
      Mel Gorman authored
      Eric Dumazet pointed out that commit 44042b44 ("mm/page_alloc: allow
      high-order pages to be stored on the per-cpu lists") only checks the
      head page during PCP refill and allocation operations.  This was an
      oversight and all pages should be checked.  This will incur a small
      performance penalty but it's necessary for correctness.
      
      Link: https://lkml.kernel.org/r/20220310092456.GJ15701@techsingularity.net
      Fixes: 44042b44
      
       ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77fe7f13
    • Eric Dumazet's avatar
      mm/page_alloc: call check_new_pages() while zone spinlock is not held · 3313204c
      Eric Dumazet authored
      
      
      For high order pages not using pcp, rmqueue() is currently calling the
      costly check_new_pages() while zone spinlock is held, and hard irqs
      masked.
      
      This is not needed, we can release the spinlock sooner to reduce zone
      spinlock contention.
      
      Note that after this patch, we call __mod_zone_freepage_state() before
      deciding to leak the page because it is in bad state.
      
      Link: https://lkml.kernel.org/r/20220304170215.1868106-1-eric.dumazet@gmail.com
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3313204c
    • Suren Baghdasaryan's avatar
      mm: count time in drain_all_pages during direct reclaim as memory pressure · fa7fc75f
      Suren Baghdasaryan authored
      
      
      When page allocation in direct reclaim path fails, the system will make
      one attempt to shrink per-cpu page lists and free pages from high alloc
      reserves.  Draining per-cpu pages into buddy allocator can be a very
      slow operation because it's done using workqueues and the task in direct
      reclaim waits for all of them to finish before proceeding.  Currently
      this time is not accounted as psi memory stall.
      
      While testing mobile devices under extreme memory pressure, when
      allocations are failing during direct reclaim, we notices that psi
      events which would be expected in such conditions were not triggered.
      After profiling these cases it was determined that the reason for
      missing psi events was that a big chunk of time spent in direct reclaim
      is not accounted as memory stall, therefore psi would not reach the
      levels at which an event is generated.  Further investigation revealed
      that the bulk of that unaccounted time was spent inside drain_all_pages
      call.
      
      A typical captured case when drain_all_pages path gets activated:
      
      __alloc_pages_slowpath  took 44.644.613ns
          __perform_reclaim   took    751.668ns (1.7%)
          drain_all_pages     took 43.887.167ns (98.3%)
      
      PSI in this case records the time spent in __perform_reclaim but ignores
      drain_all_pages, IOW it misses 98.3% of the time spent in
      __alloc_pages_slowpath.
      
      Annotate __alloc_pages_direct_reclaim in its entirety so that delays
      from handling page allocation failure in the direct reclaim path are
      accounted as memory stall.
      
      Link: https://lkml.kernel.org/r/20220223194812.1299646-1-surenb@google.com
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reported-by: default avatarTim Murray <timmurray@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa7fc75f
    • Oscar Salvador's avatar
      arch/x86/mm/numa: Do not initialize nodes twice · 1ca75fa7
      Oscar Salvador authored
      
      
      On x86, prior to ("mm: handle uninitialized numa nodes gracecully"), NUMA
      nodes could be allocated at three different places.
      
       - numa_register_memblks
       - init_cpu_to_node
       - init_gi_nodes
      
      All these calls happen at setup_arch, and have the following order:
      
      setup_arch
        ...
        x86_numa_init
         numa_init
          numa_register_memblks
        ...
        init_cpu_to_node
         init_memory_less_node
          alloc_node_data
          free_area_init_memoryless_node
        init_gi_nodes
         init_memory_less_node
          alloc_node_data
          free_area_init_memoryless_node
      
      numa_register_memblks() is only interested in those nodes which have
      memory, so it skips over any memoryless node it founds.  Later on, when
      we have read ACPI's SRAT table, we call init_cpu_to_node() and
      init_gi_nodes(), which initialize any memoryless node we might have that
      have either CPU or Initiator affinity, meaning we allocate pg_data_t
      struct for them and we mark them as ONLINE.
      
      So far so good, but the thing is that after ("mm: handle uninitialized
      numa nodes gracefully"), we allocate all possible NUMA nodes in
      free_area_init(), meaning we have a picture like the following:
      
      setup_arch
        x86_numa_init
         numa_init
          numa_register_memblks  <-- allocate non-memoryless node
        x86_init.paging.pagetable_init
         ...
          free_area_init
           free_area_init_memoryless <-- allocate memoryless node
        init_cpu_to_node
         alloc_node_data             <-- allocate memoryless node with CPU
         free_area_init_memoryless_node
        init_gi_nodes
         alloc_node_data             <-- allocate memoryless node with Initiator
         free_area_init_memoryless_node
      
      free_area_init() already allocates all possible NUMA nodes, but
      init_cpu_to_node() and init_gi_nodes() are clueless about that, so they
      go ahead and allocate a new pg_data_t struct without checking anything,
      meaning we end up allocating twice.
      
      It should be mad clear that this only happens in the case where
      memoryless NUMA node happens to have a CPU/Initiator affinity.
      
      So get rid of init_memory_less_node() and just set the node online.
      
      Note that setting the node online is needed, otherwise we choke down the
      chain when bringup_nonboot_cpus() ends up calling
      __try_online_node()->register_one_node()->...  and we blow up in
      bus_add_device().  As can be seen here:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000060
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4
        RIP: 0010:bus_add_device+0x5a/0x140
        Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004
        RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19
        RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400
        RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98
        R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0
        R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0
        Call Trace:
         device_add+0x4c0/0x910
         __register_one_node+0x97/0x2d0
         __try_online_node+0x85/0xc0
         try_online_node+0x25/0x40
         cpu_up+0x4f/0x100
         bringup_nonboot_cpus+0x4f/0x60
         smp_init+0x26/0x79
         kernel_init_freeable+0x130/0x2f1
         kernel_init+0x17/0x150
         ret_from_fork+0x22/0x30
      
      The reason is simple, by the time bringup_nonboot_cpus() gets called, we
      did not register the node_subsys bus yet, so we crash when
      bus_add_device() tries to dereference bus()->p.
      
      The following shows the order of the calls:
      
      kernel_init_freeable
       smp_init
        bringup_nonboot_cpus
         ...
           bus_add_device()      <- we did not register node_subsys yet
       do_basic_setup
        do_initcalls
         postcore_initcall(register_node_type);
          register_node_type
           subsys_system_register
            subsys_register
             bus_register         <- register node_subsys bus
      
      Why setting the node online saves us then? Well, simply because
      __try_online_node() backs off when the node is online, meaning we do not
      end up calling register_one_node() in the first place.
      
      This is subtle, broken and deserves a deep analysis and thought about
      how to put this into shape, but for now let us have this easy fix for
      the leaking memory issue.
      
      [osalvador@suse.de: add comments]
        Link: https://lkml.kernel.org/r/20220221142649.3457-1-osalvador@suse.de
      
      Link: https://lkml.kernel.org/r/20220218224302.5282-2-osalvador@suse.de
      Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Rafael Aquini <raquini@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ca75fa7
    • Mel Gorman's avatar
      mm/page_alloc: do not prefetch buddies during bulk free · 2a791f44
      Mel Gorman authored
      free_pcppages_bulk() has taken two passes through the pcp lists since
      commit 0a5f4e5b
      
       ("mm/free_pcppages_bulk: do not hold lock when
      picking pages to free") due to deferring the cost of selecting PCP lists
      until the zone lock is held.
      
      As the list processing now takes place under the zone lock, it's less
      clear that this will always benefit for two reasons.
      
      1. There is a guaranteed cost to calculating the buddy which definitely
         has to be calculated again. However, as the zone lock is held and
         there is no deferring of buddy merging, there is no guarantee that the
         prefetch will have completed when the second buddy calculation takes
         place and buddies are being merged.  With or without the prefetch, there
         may be further stalls depending on how many pages get merged. In other
         words, a stall due to merging is inevitable and at best only one stall
         might be avoided at the cost of calculating the buddy location twice.
      
      2. As the zone lock is held, prefetch_nr makes less sense as once
         prefetch_nr expires, the cache lines of interest have already been
         merged.
      
      The main concern is that there is a definite cost to calculating the
      buddy location early for the prefetch and it is a "maybe win" depending
      on whether the CPU prefetch logic and memory is fast enough.  Remove the
      prefetch logic on the basis that reduced instructions in a path is
      always a saving where as the prefetch might save one memory stall
      depending on the CPU and memory.
      
      In most cases, this has marginal benefit as the calculations are a small
      part of the overall freeing of pages.  However, it was detectable on at
      least one machine.
      
                                    5.17.0-rc3             5.17.0-rc3
                          mm-highpcplimit-v2r1     mm-noprefetch-v1r1
      Min       elapsed      630.00 (   0.00%)      610.00 (   3.17%)
      Amean     elapsed      639.00 (   0.00%)      623.00 *   2.50%*
      Max       elapsed      660.00 (   0.00%)      660.00 (   0.00%)
      
      Link: https://lkml.kernel.org/r/20220221094119.15282-2-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Suggested-by: default avatarAaron Lu <aaron.lu@intel.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a791f44
    • Mel Gorman's avatar
      mm/page_alloc: limit number of high-order pages on PCP during bulk free · f26b3fa0
      Mel Gorman authored
      
      
      When a PCP is mostly used for frees then high-order pages can exist on
      PCP lists for some time.  This is problematic when the allocation
      pattern is all allocations from one CPU and all frees from another
      resulting in colder pages being used.  When bulk freeing pages, limit
      the number of high-order pages that are stored on the PCP lists.
      
      Netperf running on localhost exhibits this pattern and while it does not
      matter for some machines, it does matter for others with smaller caches
      where cache misses cause problems due to reduced page reuse.  Pages
      freed directly to the buddy list may be reused quickly while still cache
      hot where as storing on the PCP lists may be cold by the time
      free_pcppages_bulk() is called.
      
      Using perf kmem:mm_page_alloc, the 5 most used page frames were
      
      5.17-rc3
        13041 pfn=0x111a30
        13081 pfn=0x5814d0
        13097 pfn=0x108258
        13121 pfn=0x689598
        13128 pfn=0x5814d8
      
      5.17-revert-highpcp
       192009 pfn=0x54c140
       195426 pfn=0x1081d0
       200908 pfn=0x61c808
       243515 pfn=0xa9dc20
       402523 pfn=0x222bb8
      
      5.17-full-series
       142693 pfn=0x346208
       162227 pfn=0x13bf08
       166413 pfn=0x2711e0
       166950 pfn=0x2702f8
      
      The spread is wider as there is still time before pages freed to one PCP
      get released with a tradeoff between fast reuse and reduced zone lock
      acquisition.
      
      On the machine used to gather the traces, the headline performance was
      equivalent.
      
      netperf-tcp
                                  5.17.0-rc3             5.17.0-rc3             5.17.0-rc3
                                     vanilla  mm-reverthighpcp-v1r1     mm-highpcplimit-v2
      Hmean     64         839.93 (   0.00%)      840.77 (   0.10%)      841.02 (   0.13%)
      Hmean     128       1614.22 (   0.00%)     1622.07 *   0.49%*     1636.41 *   1.37%*
      Hmean     256       2952.00 (   0.00%)     2953.19 (   0.04%)     2977.76 *   0.87%*
      Hmean     1024     10291.67 (   0.00%)    10239.17 (  -0.51%)    10434.41 *   1.39%*
      Hmean     2048     17335.08 (   0.00%)    17399.97 (   0.37%)    17134.81 *  -1.16%*
      Hmean     3312     22628.15 (   0.00%)    22471.97 (  -0.69%)    22422.78 (  -0.91%)
      Hmean     4096     25009.50 (   0.00%)    24752.83 *  -1.03%*    24740.41 (  -1.08%)
      Hmean     8192     32745.01 (   0.00%)    31682.63 *  -3.24%*    32153.50 *  -1.81%*
      Hmean     16384    39759.59 (   0.00%)    36805.78 *  -7.43%*    38948.13 *  -2.04%*
      
      On a 1-socket skylake machine with a small CPU cache that suffers more if
      cache misses are too high
      
      netperf-tcp
                                  5.17.0-rc3             5.17.0-rc3             5.17.0-rc3
                                     vanilla    mm-reverthighpcp-v1     mm-highpcplimit-v2
      Hmean     64         938.95 (   0.00%)      941.50 *   0.27%*      943.61 *   0.50%*
      Hmean     128       1843.10 (   0.00%)     1857.58 *   0.79%*     1861.09 *   0.98%*
      Hmean     256       3573.07 (   0.00%)     3667.45 *   2.64%*     3674.91 *   2.85%*
      Hmean     1024     13206.52 (   0.00%)    13487.80 *   2.13%*    13393.21 *   1.41%*
      Hmean     2048     22870.23 (   0.00%)    23337.96 *   2.05%*    23188.41 *   1.39%*
      Hmean     3312     31001.99 (   0.00%)    32206.50 *   3.89%*    31863.62 *   2.78%*
      Hmean     4096     35364.59 (   0.00%)    36490.96 *   3.19%*    36112.54 *   2.11%*
      Hmean     8192     48497.71 (   0.00%)    49954.05 *   3.00%*    49588.26 *   2.25%*
      Hmean     16384    58410.86 (   0.00%)    60839.80 *   4.16%*    62282.96 *   6.63%*
      
      Note that this was a machine that did not benefit from caching high-order
      pages and performance is almost restored with the series applied.  It's
      not fully restored as cache misses are still higher.  This is a trade-off
      between optimising for a workload that does all allocs on one CPU and
      frees on another or more general workloads that need high-order pages for
      SLUB and benefit from avoiding zone->lock for every SLUB refill/drain.
      
      Link: https://lkml.kernel.org/r/20220217002227.5739-7-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f26b3fa0
    • Mel Gorman's avatar
      mm/page_alloc: free pages in a single pass during bulk free · 8b10b465
      Mel Gorman authored
      free_pcppages_bulk() has taken two passes through the pcp lists since
      commit 0a5f4e5b
      
       ("mm/free_pcppages_bulk: do not hold lock when
      picking pages to free") due to deferring the cost of selecting PCP lists
      until the zone lock is held.  Now that list selection is simplier, the
      main cost during selection is bulkfree_pcp_prepare() which in the normal
      case is a simple check and prefetching.  As the list manipulations have
      cost in itself, go back to freeing pages in a single pass.
      
      The series up to this point was evaulated using a trunc microbenchmark
      that is truncating sparse files stored in page cache (mmtests config
      config-io-trunc).  Sparse files were used to limit filesystem
      interaction.  The results versus a revert of storing high-order pages in
      the PCP lists is
      
      1-socket Skylake
                                     5.17.0-rc3             5.17.0-rc3             5.17.0-rc3
                                        vanilla      mm-reverthighpcp-v1     mm-highpcpopt-v2
       Min       elapsed      540.00 (   0.00%)      530.00 (   1.85%)      530.00 (   1.85%)
       Amean     elapsed      543.00 (   0.00%)      530.00 *   2.39%*      530.00 *   2.39%*
       Stddev    elapsed        4.83 (   0.00%)        0.00 ( 100.00%)        0.00 ( 100.00%)
       CoeffVar  elapsed        0.89 (   0.00%)        0.00 ( 100.00%)        0.00 ( 100.00%)
       Max       elapsed      550.00 (   0.00%)      530.00 (   3.64%)      530.00 (   3.64%)
       BAmean-50 elapsed      540.00 (   0.00%)      530.00 (   1.85%)      530.00 (   1.85%)
       BAmean-95 elapsed      542.22 (   0.00%)      530.00 (   2.25%)      530.00 (   2.25%)
       BAmean-99 elapsed      542.22 (   0.00%)      530.00 (   2.25%)      530.00 (   2.25%)
      
      2-socket CascadeLake
                                     5.17.0-rc3             5.17.0-rc3             5.17.0-rc3
                                        vanilla    mm-reverthighpcp-v1       mm-highpcpopt-v2
       Min       elapsed      510.00 (   0.00%)      500.00 (   1.96%)      500.00 (   1.96%)
       Amean     elapsed      529.00 (   0.00%)      521.00 (   1.51%)      510.00 *   3.59%*
       Stddev    elapsed       16.63 (   0.00%)       12.87 (  22.64%)       11.55 (  30.58%)
       CoeffVar  elapsed        3.14 (   0.00%)        2.47 (  21.46%)        2.26 (  27.99%)
       Max       elapsed      550.00 (   0.00%)      540.00 (   1.82%)      530.00 (   3.64%)
       BAmean-50 elapsed      516.00 (   0.00%)      512.00 (   0.78%)      500.00 (   3.10%)
       BAmean-95 elapsed      526.67 (   0.00%)      518.89 (   1.48%)      507.78 (   3.59%)
       BAmean-99 elapsed      526.67 (   0.00%)      518.89 (   1.48%)      507.78 (   3.59%)
      
      The original motivation for multi-passes was will-it-scale page_fault1
      using $nr_cpu processes.
      
      2-socket CascadeLake (40 cores, 80 CPUs HT enabled)
                                                           5.17.0-rc3                 5.17.0-rc3
                                                              vanilla           mm-highpcpopt-v2
       Hmean     page_fault1-processes-2        2694662.26 (   0.00%)      2695780.35 (   0.04%)
       Hmean     page_fault1-processes-5        6425819.34 (   0.00%)      6435544.57 *   0.15%*
       Hmean     page_fault1-processes-8        9642169.10 (   0.00%)      9658962.39 (   0.17%)
       Hmean     page_fault1-processes-12      12167502.10 (   0.00%)     12190163.79 (   0.19%)
       Hmean     page_fault1-processes-21      15636859.03 (   0.00%)     15612447.26 (  -0.16%)
       Hmean     page_fault1-processes-30      25157348.61 (   0.00%)     25169456.65 (   0.05%)
       Hmean     page_fault1-processes-48      27694013.85 (   0.00%)     27671111.46 (  -0.08%)
       Hmean     page_fault1-processes-79      25928742.64 (   0.00%)     25934202.02 (   0.02%) <--
       Hmean     page_fault1-processes-110     25730869.75 (   0.00%)     25671880.65 *  -0.23%*
       Hmean     page_fault1-processes-141     25626992.42 (   0.00%)     25629551.61 (   0.01%)
       Hmean     page_fault1-processes-172     25611651.35 (   0.00%)     25614927.99 (   0.01%)
       Hmean     page_fault1-processes-203     25577298.75 (   0.00%)     25583445.59 (   0.02%)
       Hmean     page_fault1-processes-234     25580686.07 (   0.00%)     25608240.71 (   0.11%)
       Hmean     page_fault1-processes-265     25570215.47 (   0.00%)     25568647.58 (  -0.01%)
       Hmean     page_fault1-processes-296     25549488.62 (   0.00%)     25543935.00 (  -0.02%)
       Hmean     page_fault1-processes-320     25555149.05 (   0.00%)     25575696.74 (   0.08%)
      
      The differences are mostly within the noise and the difference close to
      $nr_cpus is negligible.
      
      Link: https://lkml.kernel.org/r/20220217002227.5739-6-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b10b465
    • Mel Gorman's avatar
      mm/page_alloc: drain the requested list first during bulk free · d61372bc
      Mel Gorman authored
      
      
      Prior to the series, pindex 0 (order-0 MIGRATE_UNMOVABLE) was always
      skipped first and the precise reason is forgotten.  A potential reason
      may have been to artificially preserve MIGRATE_UNMOVABLE but there is no
      reason why that would be optimal as it depends on the workload.  The
      more likely reason is that it was less complicated to do a pre-increment
      instead of a post-increment in terms of overall code flow.  As
      free_pcppages_bulk() now typically receives the pindex of the PCP list
      that exceeded high, always start draining that list.
      
      Link: https://lkml.kernel.org/r/20220217002227.5739-5-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d61372bc
    • Mel Gorman's avatar
      mm/page_alloc: simplify how many pages are selected per pcp list during bulk free · fd56eef2
      Mel Gorman authored
      
      
      free_pcppages_bulk() selects pages to free by round-robining between
      lists.  Originally this was to evenly shrink pages by migratetype but
      uneven freeing is inevitable due to high pages.  Simplify list selection
      by starting with a list that definitely has pages on it in
      free_unref_page_commit() and for drain, it does not matter where
      draining starts as all pages are removed.
      
      Link: https://lkml.kernel.org/r/20220217002227.5739-4-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd56eef2
    • Mel Gorman's avatar
      mm/page_alloc: track range of active PCP lists during bulk free · 35b6d770
      Mel Gorman authored
      
      
      free_pcppages_bulk() frees pages in a round-robin fashion.  Originally,
      this was dealing only with migratetypes but storing high-order pages
      means that there can be many more empty lists that are uselessly
      checked.  Track the minimum and maximum active pindex to reduce the
      search space.
      
      Link: https://lkml.kernel.org/r/20220217002227.5739-3-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      35b6d770
    • Mel Gorman's avatar
      mm/page_alloc: fetch the correct pcp buddy during bulk free · ca7b59b1
      Mel Gorman authored
      Patch series "Follow-up on high-order PCP caching", v2.
      
      Commit 44042b44 ("mm/page_alloc: allow high-order pages to be stored
      on the per-cpu lists") was primarily aimed at reducing the cost of SLUB
      cache refills of high-order pages in two ways.  Firstly, zone lock
      acquisitions was reduced and secondly, there were fewer buddy list
      modifications.  This is a follow-up series fixing some issues that
      became apparant after merging.
      
      Patch 1 is a functional fix.  It's harmless but inefficient.
      
      Patches 2-5 reduce the overhead of bulk freeing of PCP pages.  While the
      overhead is small, it's cumulative and noticable when truncating large
      files.  The changelog for patch 4 includes results of a microbench that
      deletes large sparse files with data in page cache.  Sparse files were
      used to eliminate filesystem overhead.
      
      Patch 6 addresses issues with high-order PCP pages being stored on PCP
      lists for too long.  Pages freed on a CPU potentially may not be quickly
      reused and in some cases this can increase cache miss rates.  Details
      are included in the changelog.
      
      This patch (of 6):
      
      free_pcppages_bulk() prefetches buddies about to be freed but the order
      must also be passed in as PCP lists store multiple orders.
      
      Link: https://lkml.kernel.org/r/20220217002227.5739-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20220217002227.5739-2-mgorman@techsingularity.net
      Fixes: 44042b44
      
       ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAaron Lu <aaron.lu@intel.com>
      Tested-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca7b59b1