Skip to content
  1. Jun 30, 2021
    • Naoya Horiguchi's avatar
      mm,hwpoison: send SIGBUS with error virutal address · a3f5d80e
      Naoya Horiguchi authored
      
      
      Now an action required MCE in already hwpoisoned address surely sends a
      SIGBUS to current process, but the SIGBUS doesn't convey error virtual
      address.  That's not optimal for hwpoison-aware applications.
      
      To fix the issue, make memory_failure() call kill_accessing_process(),
      that does pagetable walk to find the error virtual address.  It could find
      multiple virtual addresses for the same error page, and it seems hard to
      tell which virtual address is correct one.  But that's rare and sending
      incorrect virtual address could be better than no address.  So let's
      report the first found virtual address for now.
      
      [naoya.horiguchi@nec.com: fix walk_page_range() return]
        Link: https://lkml.kernel.org/r/20210603051055.GA244241@hori.linux.bs1.fc.nec.co.jp
      
      Link: https://lkml.kernel.org/r/20210521030156.2612074-4-nao.horiguchi@gmail.com
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Jue Wang <juew@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3f5d80e
    • Mel Gorman's avatar
      mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes · 203c06ee
      Mel Gorman authored
      
      
      Dave Hansen reported the following about Feng Tang's tests on a machine
      with persistent memory onlined as a DRAM-like device.
      
        Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
        ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
        use" mode and being managed via the buddy just like the normal RAM.
      
        The PMEM zones are big ones:
      
              present  65011712 = 248 G
              high       134595 = 525 M
      
        The PMEM nodes, of course, don't have any CPUs in them.
      
        With your series, the pcp->high value per-cpu is 69584 pages or about
        270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
        worst-case memory in the pcps per zone, or roughly 10% of the size of
        the zone.
      
      This should not cause a problem as such although it could trigger reclaim
      due to pages being stored on per-cpu lists for CPUs remote to a node.  It
      is not possible to treat cpuless nodes exactly the same as normal nodes
      but the worst-case scenario can be mitigated by splitting pcp->high across
      all online CPUs for cpuless memory nodes.
      
      Link: https://lkml.kernel.org/r/20210616110743.GK30378@techsingularity.net
      Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Tang, Feng" <feng.tang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      203c06ee
    • Mel Gorman's avatar
      mm/page_alloc: allow high-order pages to be stored on the per-cpu lists · 44042b44
      Mel Gorman authored
      
      
      The per-cpu page allocator (PCP) only stores order-0 pages.  This means
      that all THP and "cheap" high-order allocations including SLUB contends on
      the zone->lock.  This patch extends the PCP allocator to store THP and
      "cheap" high-order pages.  Note that struct per_cpu_pages increases in
      size to 256 bytes (4 cache lines) on x86-64.
      
      Note that this is not necessarily a universal performance win because of
      how it is implemented.  High-order pages can cause pcp->high to be
      exceeded prematurely for lower-orders so for example, a large number of
      THP pages being freed could release order-0 pages from the PCP lists.
      Hence, much depends on the allocation/free pattern as observed by a single
      CPU to determine if caching helps or hurts a particular workload.
      
      That said, basic performance testing passed.  The following is a netperf
      UDP_STREAM test which hits the relevant patches as some of the network
      allocations are high-order.
      
      netperf-udp
                                       5.13.0-rc2             5.13.0-rc2
                                 mm-pcpburst-v3r4   mm-pcphighorder-v1r7
      Hmean     send-64         261.46 (   0.00%)      266.30 *   1.85%*
      Hmean     send-128        516.35 (   0.00%)      536.78 *   3.96%*
      Hmean     send-256       1014.13 (   0.00%)     1034.63 *   2.02%*
      Hmean     send-1024      3907.65 (   0.00%)     4046.11 *   3.54%*
      Hmean     send-2048      7492.93 (   0.00%)     7754.85 *   3.50%*
      Hmean     send-3312     11410.04 (   0.00%)    11772.32 *   3.18%*
      Hmean     send-4096     13521.95 (   0.00%)    13912.34 *   2.89%*
      Hmean     send-8192     21660.50 (   0.00%)    22730.72 *   4.94%*
      Hmean     send-16384    31902.32 (   0.00%)    32637.50 *   2.30%*
      
      Functionally, a patch like this is necessary to make bulk allocation of
      high-order pages work with similar performance to order-0 bulk
      allocations.  The bulk allocator is not updated in this series as it would
      have to be determined by bulk allocation users how they want to track the
      order of pages allocated with the bulk allocator.
      
      Link: https://lkml.kernel.org/r/20210611135753.GC30378@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      44042b44
    • Mike Rapoport's avatar
      mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM · 43b02ba9
      Mike Rapoport authored
      
      
      After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP
      configuration option is equivalent to FLATMEM.
      
      Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-10-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43b02ba9
    • Mike Rapoport's avatar
      mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA · a9ee6cf5
      Mike Rapoport authored
      
      
      After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
      configuration options are equivalent.
      
      Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.
      
      Done with
      
      	$ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
      		$(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
      	$ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
      		$(git grep -wl NEED_MULTIPLE_NODES)
      
      with manual tweaks afterwards.
      
      [rppt@linux.ibm.com: fix arm boot crash]
        Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9ee6cf5
    • Mike Rapoport's avatar
      docs: remove description of DISCONTIGMEM · 48d9f335
      Mike Rapoport authored
      
      
      Remove description of DISCONTIGMEM from the "Memory Models" document and
      update VM sysctl description so that it won't mention DISCONIGMEM.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-8-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48d9f335
    • Mike Rapoport's avatar
      arch, mm: remove stale mentions of DISCONIGMEM · d3c251ab
      Mike Rapoport authored
      
      
      There are several places that mention DISCONIGMEM in comments or have
      stale code guarded by CONFIG_DISCONTIGMEM.
      
      Remove the dead code and update the comments.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-7-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3c251ab
    • Mike Rapoport's avatar
      mm: remove CONFIG_DISCONTIGMEM · bb1c50d3
      Mike Rapoport authored
      
      
      There are no architectures that support DISCONTIGMEM left.
      
      Remove the configuration option and the dead code it was guarding in the
      generic memory management code.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-6-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb1c50d3
    • Mike Rapoport's avatar
      m68k: remove support for DISCONTIGMEM · 5ab06e10
      Mike Rapoport authored
      
      
      DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
      in v5.11.
      
      Remove the support for DISCONTIGMEM entirely.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-5-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ab06e10
    • Mike Rapoport's avatar
      arc: remove support for DISCONTIGMEM · 8b793b44
      Mike Rapoport authored
      
      
      DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
      in v5.11.
      
      Remove the support for DISCONTIGMEM entirely.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-4-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b793b44
    • Mike Rapoport's avatar
      arc: update comment about HIGHMEM implementation · e7793e53
      Mike Rapoport authored
      
      
      Arc does not use DISCONTIGMEM to implement high memory, update the comment
      describing how high memory works to reflect this.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-3-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7793e53
    • Mike Rapoport's avatar
      alpha: remove DISCONTIGMEM and NUMA · fdb7d9b7
      Mike Rapoport authored
      
      
      Patch series "Remove DISCONTIGMEM memory model", v3.
      
      SPARSEMEM memory model was supposed to entirely replace DISCONTIGMEM a
      (long) while ago.  The last architectures that used DISCONTIGMEM were
      updated to use other memory models in v5.11 and it is about the time to
      entirely remove DISCONTIGMEM from the kernel.
      
      This set removes DISCONTIGMEM from alpha, arc and m68k, simplifies memory
      model selection in mm/Kconfig and replaces usage of redundant
      CONFIG_NEED_MULTIPLE_NODES and CONFIG_FLAT_NODE_MEM_MAP with CONFIG_NUMA
      and CONFIG_FLATMEM respectively.
      
      I've also removed NUMA support on alpha that was BROKEN for more than 15
      years.
      
      There were also minor updates all over arch/ to remove mentions of
      DISCONTIGMEM in comments and #ifdefs.
      
      This patch (of 9):
      
      NUMA is marked broken on alpha for more than 15 years and DISCONTIGMEM was
      replaced with SPARSEMEM in v5.11.
      
      Remove both NUMA and DISCONTIGMEM support from alpha.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20210608091316.3622-2-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fdb7d9b7
    • Mel Gorman's avatar
      mm/page_alloc: move free_the_page · 21d02f8f
      Mel Gorman authored
      
      
      Patch series "Allow high order pages to be stored on PCP", v2.
      
      The per-cpu page allocator (PCP) only handles order-0 pages.  With the
      series "Use local_lock for pcp protection and reduce stat overhead" and
      "Calculate pcp->high based on zone sizes and active CPUs", it's now
      feasible to store high-order pages on PCP lists.
      
      This small series allows PCP to store "cheap" orders where cheap is
      determined by PAGE_ALLOC_COSTLY_ORDER and THP-sized allocations.
      
      This patch (of 2):
      
      In the next page, free_compount_page is going to use the common helper
      free_the_page.  This patch moves the definition to ease review.  No
      functional change.
      
      Link: https://lkml.kernel.org/r/20210603142220.10851-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20210603142220.10851-2-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21d02f8f
    • Liu Shixin's avatar
      mm/page_alloc: fix counting of managed_pages · f7ec1044
      Liu Shixin authored
      commit f6366156 ("mm/page_alloc.c: clear out zone->lowmem_reserve[] if
      the zone is empty") clears out zone->lowmem_reserve[] if zone is empty.
      But when zone is not empty and sysctl_lowmem_reserve_ratio[i] is set to
      zero, zone_managed_pages(zone) is not counted in the managed_pages either.
      This is inconsistent with the description of lowmem_reserve, so fix it.
      
      Link: https://lkml.kernel.org/r/20210527125707.3760259-1-liushixin2@huawei.com
      Fixes: f6366156
      
       ("mm/page_alloc.c: clear out zone->lowmem_reserve[] if the zone is empty")
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Reported-by: default avataryangerkun <yangerkun@huawei.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7ec1044
    • Dong Aisheng's avatar
      mm/page_alloc: improve memmap_pages dbg msg · e47aa905
      Dong Aisheng authored
      
      
      Make debug message more accurate.
      
      Link: https://lkml.kernel.org/r/20210531091908.1738465-6-aisheng.dong@nxp.com
      Signed-off-by: default avatarDong Aisheng <aisheng.dong@nxp.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e47aa905
    • Dong Aisheng's avatar
      mm: drop SECTION_SHIFT in code comments · 777c00f5
      Dong Aisheng authored
      Actually SECTIONS_SHIFT is used in the kernel code, so the code comments
      is strictly incorrect.  And since commit bbeae5b0
      
       ("mm: move page
      flags layout to separate header"), SECTIONS_SHIFT definition has been
      moved to include/linux/page-flags-layout.h, since code itself looks quite
      straighforward, instead of moving the code comment into the new place as
      well, we just simply remove it.
      
      This also fixed a checkpatch complain derived from the original code:
      WARNING: please, no space before tabs
      + * SECTIONS_SHIFT    ^I^I#bits space required to store a section #$
      
      Link: https://lkml.kernel.org/r/20210531091908.1738465-2-aisheng.dong@nxp.com
      Signed-off-by: default avatarDong Aisheng <aisheng.dong@nxp.com>
      Suggested-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      777c00f5
    • Mel Gorman's avatar
      mm/page_alloc: introduce vm.percpu_pagelist_high_fraction · 74f44822
      Mel Gorman authored
      
      
      This introduces a new sysctl vm.percpu_pagelist_high_fraction.  It is
      similar to the old vm.percpu_pagelist_fraction.  The old sysctl increased
      both pcp->batch and pcp->high with the higher pcp->high potentially
      reducing zone->lock contention.  However, the higher pcp->batch value also
      potentially increased allocation latency while the PCP was refilled.  This
      sysctl only adjusts pcp->high so that zone->lock contention is potentially
      reduced but allocation latency during a PCP refill remains the same.
      
        # grep -E "high:|batch" /proc/zoneinfo | tail -2
                    high:  649
                    batch: 63
      
        # sysctl vm.percpu_pagelist_high_fraction=8
        # grep -E "high:|batch" /proc/zoneinfo | tail -2
                    high:  35071
                    batch: 63
      
        # sysctl vm.percpu_pagelist_high_fraction=64
                    high:  4383
                    batch: 63
      
        # sysctl vm.percpu_pagelist_high_fraction=0
                    high:  649
                    batch: 63
      
      [mgorman@techsingularity.net: fix documentation]
        Link: https://lkml.kernel.org/r/20210528151010.GQ30378@techsingularity.net
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-7-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74f44822
    • Mel Gorman's avatar
      mm/page_alloc: limit the number of pages on PCP lists when reclaim is active · c49c2c47
      Mel Gorman authored
      
      
      When kswapd is active then direct reclaim is potentially active.  In
      either case, it is possible that a zone would be balanced if pages were
      not trapped on PCP lists.  Instead of draining remote pages, simply limit
      the size of the PCP lists while kswapd is active.
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-6-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c49c2c47
    • Mel Gorman's avatar
      mm/page_alloc: scale the number of pages that are batch freed · 3b12e7e9
      Mel Gorman authored
      
      
      When a task is freeing a large number of order-0 pages, it may acquire the
      zone->lock multiple times freeing pages in batches.  This may
      unnecessarily contend on the zone lock when freeing very large number of
      pages.  This patch adapts the size of the batch based on the recent
      pattern to scale the batch size for subsequent frees.
      
      As the machines I used were not large enough to test this are not large
      enough to illustrate a problem, a debugging patch shows patterns like the
      following (slightly editted for clarity)
      
      Baseline vanilla kernel
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
      
      With patches
        time-unmap-7724    [...] free_pcppages_bulk: free  126 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  252 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  504 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  751 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  751 count  814 high  814
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-5-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b12e7e9
    • Mel Gorman's avatar
      mm/page_alloc: adjust pcp->high after CPU hotplug events · 04f8cfea
      Mel Gorman authored
      
      
      The PCP high watermark is based on the number of online CPUs so the
      watermarks must be adjusted during CPU hotplug.  At the time of
      hot-remove, the number of online CPUs is already adjusted but during
      hot-add, a delta needs to be applied to update PCP to the correct value.
      After this patch is applied, the high watermarks are adjusted correctly.
      
        # grep high: /proc/zoneinfo  | tail -1
                    high:  649
        # echo 0 > /sys/devices/system/cpu/cpu4/online
        # grep high: /proc/zoneinfo  | tail -1
                    high:  664
        # echo 1 > /sys/devices/system/cpu/cpu4/online
        # grep high: /proc/zoneinfo  | tail -1
                    high:  649
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-4-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04f8cfea
    • Mel Gorman's avatar
      mm/page_alloc: disassociate the pcp->high from pcp->batch · b92ca18e
      Mel Gorman authored
      
      
      The pcp high watermark is based on the batch size but there is no
      relationship between them other than it is convenient to use early in
      boot.
      
      This patch takes the first step and bases pcp->high on the zone low
      watermark split across the number of CPUs local to a zone while the batch
      size remains the same to avoid increasing allocation latencies.  The
      intent behind the default pcp->high is "set the number of PCP pages such
      that if they are all full that background reclaim is not started
      prematurely".
      
      Note that in this patch the pcp->high values are adjusted after memory
      hotplug events, min_free_kbytes adjustments and watermark scale factor
      adjustments but not CPU hotplug events which is handled later in the
      series.
      
      On a test KVM instance;
      
      Before grep -E "high:|batch" /proc/zoneinfo | tail -2
                    high:  378
                    batch: 63
      
      After grep -E "high:|batch" /proc/zoneinfo | tail -2
                    high:  649
                    batch: 63
      
      [mgorman@techsingularity.net:  fix __setup_per_zone_wmarks for parallel memory
      hotplug]
        Link: https://lkml.kernel.org/r/20210528105925.GN30378@techsingularity.net
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-3-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b92ca18e
    • Mel Gorman's avatar
      mm/page_alloc: delete vm.percpu_pagelist_fraction · bbbecb35
      Mel Gorman authored
      
      
      Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2.
      
      The per-cpu page allocator (PCP) is meant to reduce contention on the zone
      lock but the sizing of batch and high is archaic and neither takes the
      zone size into account or the number of CPUs local to a zone.  With larger
      zones and more CPUs per node, the contention is getting worse.
      Furthermore, the fact that vm.percpu_pagelist_fraction adjusts both batch
      and high values means that the sysctl can reduce zone lock contention but
      also increase allocation latencies.
      
      This series disassociates pcp->high from pcp->batch and then scales
      pcp->high based on the size of the local zone with limited impact to
      reclaim and accounting for active CPUs but leaves pcp->batch static.  It
      also adapts the number of pages that can be on the pcp list based on
      recent freeing patterns.
      
      The motivation is partially to adjust to larger memory sizes but is also
      driven by the fact that large batches of page freeing via release_pages()
      often shows zone contention as a major part of the problem.  Another is a
      bug report based on an older kernel where a multi-terabyte process can
      takes several minutes to exit.  A workaround was to use
      vm.percpu_pagelist_fraction to increase the pcp->high value but testing
      indicated that a production workload could not use the same values because
      of an increase in allocation latencies.  Unfortunately, I cannot reproduce
      this test case myself as the multi-terabyte machines are in active use but
      it should alleviate the problem.
      
      The series aims to address both and partially acts as a pre-requisite.
      pcp only works with order-0 which is useless for SLUB (when using high
      orders) and THP (unconditionally).  To store high-order pages on PCP, the
      pcp->high values need to be increased first.
      
      This patch (of 6):
      
      The vm.percpu_pagelist_fraction is used to increase the batch and high
      limits for the per-cpu page allocator (PCP).  The intent behind the sysctl
      is to reduce zone lock acquisition when allocating/freeing pages but it
      has a problem.  While it can decrease contention, it can also increase
      latency on the allocation side due to unreasonably large batch sizes.
      This leads to games where an administrator adjusts
      percpu_pagelist_fraction on the fly to work around contention and
      allocation latency problems.
      
      This series aims to alleviate the problems with zone lock contention while
      avoiding the allocation-side latency problems.  For the purposes of
      review, it's easier to remove this sysctl now and reintroduce a similar
      sysctl later in the series that deals only with pcp->high.
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20210525080119.5455-2-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbbecb35
    • Minchan Kim's avatar
      mm: page_alloc: dump migrate-failed pages only at -EBUSY · 151e084a
      Minchan Kim authored
      
      
      alloc_contig_dump_pages() aims for helping debugging page migration
      failure by elevated page refcount compared to expected_count.  (for the
      detail, please look at migrate_page_move_mapping)
      
      However, -ENOMEM is just the case that system is under memory pressure
      state, not relevant with page refcount at all.  Thus, the dumping page
      list is not helpful for the debugging point of view.
      
      Link: https://lkml.kernel.org/r/YKa2Wyo9xqIErpfa@google.com
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      151e084a
    • Mel Gorman's avatar
      mm/page_alloc: update PGFREE outside the zone lock in __free_pages_ok · 90249993
      Mel Gorman authored
      
      
      VM events do not need explicit protection by disabling IRQs so update the
      counter with IRQs enabled in __free_pages_ok.
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-10-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90249993
    • Mel Gorman's avatar
      mm/page_alloc: avoid conflating IRQs disabled with zone->lock · df1acc85
      Mel Gorman authored
      
      
      Historically when freeing pages, free_one_page() assumed that callers had
      IRQs disabled and the zone->lock could be acquired with spin_lock().  This
      confuses the scope of what local_lock_irq is protecting and what
      zone->lock is protecting in free_unref_page_list in particular.
      
      This patch uses spin_lock_irqsave() for the zone->lock in free_one_page()
      instead of relying on callers to have disabled IRQs.
      free_unref_page_commit() is changed to only deal with PCP pages protected
      by the local lock.  free_unref_page_list() then first frees isolated pages
      to the buddy lists with free_one_page() and frees the rest of the pages to
      the PCP via free_unref_page_commit().  The end result is that
      free_one_page() is no longer depending on side-effects of local_lock to be
      correct.
      
      Note that this may incur a performance penalty while memory hot-remove is
      running but that is not a common operation.
      
      [lkp@intel.com: Ensure CMA pages get addded to correct pcp list]
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-9-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df1acc85
    • Mel Gorman's avatar
      mm/page_alloc: explicitly acquire the zone lock in __free_pages_ok · 56f0e661
      Mel Gorman authored
      
      
      __free_pages_ok() disables IRQs before calling a common helper
      free_one_page() that acquires the zone lock.  This is not safe according
      to Documentation/locking/locktypes.rst and in this context, IRQ disabling
      is not protecting a per_cpu_pages structure either or a local_lock would
      be used.
      
      This patch explicitly acquires the lock with spin_lock_irqsave instead of
      relying on a helper.  This removes the last instance of local_irq_save()
      in page_alloc.c.
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-8-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56f0e661
    • Mel Gorman's avatar
      mm/page_alloc: reduce duration that IRQs are disabled for VM counters · 43c95bcc
      Mel Gorman authored
      
      
      IRQs are left disabled for the zone and node VM event counters.  This is
      unnecessary as the affected counters are allowed to race for preemmption
      and IRQs.
      
      This patch reduces the scope of IRQs being disabled via
      local_[lock|unlock]_irq on !PREEMPT_RT kernels.  One
      __mod_zone_freepage_state is still called with IRQs disabled.  While this
      could be moved out, it's not free on all architectures as some require
      IRQs to be disabled for mod_zone_page_state on !PREEMPT_RT kernels.
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-7-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43c95bcc
    • Mel Gorman's avatar
      mm/page_alloc: batch the accounting updates in the bulk allocator · 3e23060b
      Mel Gorman authored
      
      
      Now that the zone_statistics are simple counters that do not require
      special protection, the bulk allocator accounting updates can be batch
      updated without adding too much complexity with protected RMW updates or
      using xchg.
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-6-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e23060b
    • Mel Gorman's avatar
      mm/vmstat: inline NUMA event counter updates · 3ac44a34
      Mel Gorman authored
      
      
      __count_numa_event is small enough to be treated similarly to
      __count_vm_event so inline it.
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-5-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ac44a34
    • Mel Gorman's avatar
      mm/vmstat: convert NUMA statistics to basic NUMA counters · f19298b9
      Mel Gorman authored
      
      
      NUMA statistics are maintained on the zone level for hits, misses, foreign
      etc but nothing relies on them being perfectly accurate for functional
      correctness.  The counters are used by userspace to get a general overview
      of a workloads NUMA behaviour but the page allocator incurs a high cost to
      maintain perfect accuracy similar to what is required for a vmstat like
      NR_FREE_PAGES.  There even is a sysctl vm.numa_stat to allow userspace to
      turn off the collection of NUMA statistics like NUMA_HIT.
      
      This patch converts NUMA_HIT and friends to be NUMA events with similar
      accuracy to VM events.  There is a possibility that slight errors will be
      introduced but the overall trend as seen by userspace will be similar.
      The counters are no longer updated from vmstat_refresh context as it is
      unnecessary overhead for counters that may never be read by userspace.
      Note that counters could be maintained at the node level to save space but
      it would have a user-visible impact due to /proc/zoneinfo.
      
      [lkp@intel.com: Fix misplaced closing brace for !CONFIG_NUMA]
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-4-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f19298b9
    • Mel Gorman's avatar
      mm/page_alloc: convert per-cpu list protection to local_lock · dbbee9d5
      Mel Gorman authored
      
      
      There is a lack of clarity of what exactly
      local_irq_save/local_irq_restore protects in page_alloc.c .  It conflates
      the protection of per-cpu page allocation structures with per-cpu vmstat
      deltas.
      
      This patch protects the PCP structure using local_lock which for most
      configurations is identical to IRQ enabling/disabling.  The scope of the
      lock is still wider than it should be but this is decreased later.
      
      It is possible for the local_lock to be embedded safely within struct
      per_cpu_pages but it adds complexity to free_unref_page_list.
      
      [akpm@linux-foundation.org: coding style fixes]
      [mgorman@techsingularity.net: work around a pahole limitation with zero-sized struct pagesets]
        Link: https://lkml.kernel.org/r/20210526080741.GW30378@techsingularity.net
      [lkp@intel.com: Make pagesets static]
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-3-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dbbee9d5
    • Mel Gorman's avatar
      mm/page_alloc: split per cpu page lists and zone stats · 28f836b6
      Mel Gorman authored
      
      
      The PCP (per-cpu page allocator in page_alloc.c) shares locking
      requirements with vmstat and the zone lock which is inconvenient and
      causes some issues.  For example, the PCP list and vmstat share the same
      per-cpu space meaning that it's possible that vmstat updates dirty cache
      lines holding per-cpu lists across CPUs unless padding is used.  Second,
      PREEMPT_RT does not want to disable IRQs for too long in the page
      allocator.
      
      This series splits the locking requirements and uses locks types more
      suitable for PREEMPT_RT, reduces the time when special locking is required
      for stats and reduces the time when IRQs need to be disabled on
      !PREEMPT_RT kernels.
      
      Why local_lock?  PREEMPT_RT considers the following sequence to be unsafe
      as documented in Documentation/locking/locktypes.rst
      
         local_irq_disable();
         spin_lock(&lock);
      
      The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
      -> __rmqueue_pcplist -> rmqueue_bulk (spin_lock).  While it's possible to
      separate this out, it generally means there are points where we enable
      IRQs and reenable them again immediately.  To prevent a migration and the
      per-cpu pointer going stale, migrate_disable is also needed.  That is a
      custom lock that is similar, but worse, than local_lock.  Furthermore, on
      PREEMPT_RT, it's undesirable to leave IRQs disabled for too long.  By
      converting to local_lock which disables migration on PREEMPT_RT, the
      locking requirements can be separated and start moving the protections for
      PCP, stats and the zone lock to PREEMPT_RT-safe equivalent locking.  As a
      bonus, local_lock also means that PROVE_LOCKING does something useful.
      
      After that, it's obvious that zone_statistics incurs too much overhead and
      leaves IRQs disabled for longer than necessary on !PREEMPT_RT kernels.
      zone_statistics uses perfectly accurate counters requiring IRQs be
      disabled for parallel RMW sequences when inaccurate ones like vm_events
      would do.  The series makes the NUMA statistics (NUMA_HIT and friends)
      inaccurate counters that then require no special protection on
      !PREEMPT_RT.
      
      The bulk page allocator can then do stat updates in bulk with IRQs enabled
      which should improve the efficiency.  Technically, this could have been
      done without the local_lock and vmstat conversion work and the order
      simply reflects the timing of when different series were implemented.
      
      Finally, there are places where we conflate IRQs being disabled for the
      PCP with the IRQ-safe zone spinlock.  The remainder of the series reduces
      the scope of what is protected by disabled IRQs on !PREEMPT_RT kernels.
      By the end of the series, page_alloc.c does not call local_irq_save so the
      locking scope is a bit clearer.  The one exception is that modifying
      NR_FREE_PAGES still happens in places where it's known the IRQs are
      disabled as it's harmless for PREEMPT_RT and would be expensive to split
      the locking there.
      
      No performance data is included because despite the overhead of the stats,
      it's within the noise for most workloads on !PREEMPT_RT.  However, Jesper
      Dangaard Brouer ran a page allocation microbenchmark on a E5-1650 v4 @
      3.60GHz CPU on the first version of this series.  Focusing on the array
      variant of the bulk page allocator reveals the following.
      
      (CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)
      ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size
      
               Baseline        Patched
       1       56.383          54.225 (+3.83%)
       2       40.047          35.492 (+11.38%)
       3       37.339          32.643 (+12.58%)
       4       35.578          30.992 (+12.89%)
       8       33.592          29.606 (+11.87%)
       16      32.362          28.532 (+11.85%)
       32      31.476          27.728 (+11.91%)
       64      30.633          27.252 (+11.04%)
       128     30.596          27.090 (+11.46%)
      
      While this is a positive outcome, the series is more likely to be
      interesting to the RT people in terms of getting parts of the PREEMPT_RT
      tree into mainline.
      
      This patch (of 9):
      
      The per-cpu page allocator lists and the per-cpu vmstat deltas are stored
      in the same struct per_cpu_pages even though vmstats have no direct impact
      on the per-cpu page lists.  This is inconsistent because the vmstats for a
      node are stored on a dedicated structure.  The bigger issue is that the
      per_cpu_pages structure is not cache-aligned and stat updates either cache
      conflict with adjacent per-cpu lists incurring a runtime cost or padding
      is required incurring a memory cost.
      
      This patch splits the per-cpu pagelists and the vmstat deltas into
      separate structures.  It's mostly a mechanical conversion but some
      variable renaming is done to clearly distinguish the per-cpu pages
      structure (pcp) from the vmstats (pzstats).
      
      Superficially, this appears to increase the size of the per_cpu_pages
      structure but the movement of expire fills a structure hole so there is no
      impact overall.
      
      [mgorman@techsingularity.net: make it W=1 cleaner]
        Link: https://lkml.kernel.org/r/20210514144622.GA3735@techsingularity.net
      [mgorman@techsingularity.net: make it W=1 even cleaner]
        Link: https://lkml.kernel.org/r/20210516140705.GB3735@techsingularity.net
      [lkp@intel.com: check struct per_cpu_zonestat has a non-zero size]
      [vbabka@suse.cz: Init zone->per_cpu_zonestats properly]
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20210512095458.30632-2-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28f836b6
    • Andrii Nakryiko's avatar
      kbuild: skip per-CPU BTF generation for pahole v1.18-v1.21 · a0b8200d
      Andrii Nakryiko authored
      
      
      Commit "mm/page_alloc: convert per-cpu list protection to local_lock" will
      introduce a zero-sized per-CPU variable, which causes pahole to generate
      invalid BTF.  Only pahole versions 1.18 through 1.21 are impacted, as
      before 1.18 pahole doesn't know anything about per-CPU variables, and 1.22
      contains the proper fix for the issue.
      
      Luckily, pahole 1.18 got --skip_encoding_btf_vars option disabling BTF
      generation for per-CPU variables in anticipation of some unanticipated
      problems.  So use this escape hatch to disable per-CPU var BTF info on
      those problematic pahole versions.  Users relying on availability of
      per-CPU var BTFs would need to upgrade to pahole 1.22+, but everyone won't
      notice any regressions.
      
      Link: https://lkml.kernel.org/r/20210530002536.3193829-1-andrii@kernel.org
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Hao Luo <haoluo@google.com>
      Cc: Michal Suchanek <msuchanek@suse.de>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a0b8200d
    • Heiner Kallweit's avatar
      mm/page_alloc: switch to pr_debug · 9660ecaa
      Heiner Kallweit authored
      
      
      Having such debug messages in the dmesg log may confuse users.  Therefore
      restrict debug output to cases where DEBUG is defined or dynamic debugging
      is enabled for the respective code piece.
      
      Link: https://lkml.kernel.org/r/976adb93-3041-ce63-48fc-55a6096a51c1@gmail.com
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9660ecaa
    • Matthew Wilcox (Oracle)'s avatar
      mm: optimise nth_page for contiguous memmap · 1cfcee72
      Matthew Wilcox (Oracle) authored
      
      
      If the memmap is virtually contiguous (either because we're using a
      virtually mapped memmap or because we don't support a discontig memmap at
      all), then we can implement nth_page() by simple addition.  Contrary to
      popular belief, the compiler is not able to optimise this itself for a
      vmemmap configuration.  This reduces one example user (sg.c) by four
      instructions:
      
              struct page *page = nth_page(rsv_schp->pages[k], offset >> PAGE_SHIFT);
      
      before:
         49 8b 45 70             mov    0x70(%r13),%rax
         48 63 c9                movslq %ecx,%rcx
         48 c1 eb 0c             shr    $0xc,%rbx
         48 8b 04 c8             mov    (%rax,%rcx,8),%rax
         48 2b 05 00 00 00 00    sub    0x0(%rip),%rax
                 R_X86_64_PC32      vmemmap_base-0x4
         48 c1 f8 06             sar    $0x6,%rax
         48 01 d8                add    %rbx,%rax
         48 c1 e0 06             shl    $0x6,%rax
         48 03 05 00 00 00 00    add    0x0(%rip),%rax
                 R_X86_64_PC32      vmemmap_base-0x4
      
      after:
         49 8b 45 70             mov    0x70(%r13),%rax
         48 63 c9                movslq %ecx,%rcx
         48 c1 eb 0c             shr    $0xc,%rbx
         48 c1 e3 06             shl    $0x6,%rbx
         48 03 1c c8             add    (%rax,%rcx,8),%rbx
      
      Link: https://lkml.kernel.org/r/20210413194625.1472345-1-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Douglas Gilbert <dougg@torque.net>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1cfcee72
    • Matthew Wilcox (Oracle)'s avatar
      mm: constify page_count and page_ref_count · 5f7dadf3
      Matthew Wilcox (Oracle) authored
      
      
      Now that compound_head() accepts a const struct page pointer, these two
      functions can be marked as not modifying the page pointer they are passed.
      
      Link: https://lkml.kernel.org/r/20210416231531.2521383-7-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f7dadf3
    • Matthew Wilcox (Oracle)'s avatar
      mm: constify get_pfnblock_flags_mask and get_pfnblock_migratetype · ca891f41
      Matthew Wilcox (Oracle) authored
      
      
      The struct page is not modified by these routines, so it can be marked
      const.
      
      Link: https://lkml.kernel.org/r/20210416231531.2521383-6-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca891f41
    • Matthew Wilcox (Oracle)'s avatar
      mm: make compound_head const-preserving · 0f2317e3
      Matthew Wilcox (Oracle) authored
      
      
      If you pass a const pointer to compound_head(), you get a const pointer
      back; if you pass a mutable pointer, you get a mutable pointer back.  Also
      remove an unnecessary forward definition of struct page; we're about to
      dereference page->compound_head, so it must already have been defined.
      
      Link: https://lkml.kernel.org/r/20210416231531.2521383-5-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f2317e3
    • Matthew Wilcox (Oracle)'s avatar
      mm/page_owner: constify dump_page_owner · 8bf6f451
      Matthew Wilcox (Oracle) authored
      
      
      dump_page_owner() only uses struct page to find the page_ext, and
      lookup_page_ext() already takes a const argument.
      
      Link: https://lkml.kernel.org/r/20210416231531.2521383-4-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8bf6f451
    • Matthew Wilcox (Oracle)'s avatar
      mm/debug: factor PagePoisoned out of __dump_page · be7c701f
      Matthew Wilcox (Oracle) authored
      
      
      Move the PagePoisoned test into dump_page().  Skip the hex print for
      poisoned pages -- we know they're full of ffffffff.  Move the reason
      printing from __dump_page() to dump_page().
      
      Link: https://lkml.kernel.org/r/20210416231531.2521383-3-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be7c701f