Skip to content
  1. Mar 29, 2023
    • Zi Yan's avatar
      selftests/mm: fix split huge page tests · dd63bd7d
      Zi Yan authored
      Fix two inputs to check_anon_huge() and one if condition, so the tests
      work as expected.
      
      Link: https://lkml.kernel.org/r/20230306160907.16804-1-zi.yan@sent.com
      Fixes: c07c343c
      
       ("selftests/vm: dedup THP helpers")
      Signed-off-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarZach O'Keefe <zokeefe@google.com>
      Tested-by: default avatarZach O'Keefe <zokeefe@google.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dd63bd7d
    • Gerald Schaefer's avatar
      mm: add PTE pointer parameter to flush_tlb_fix_spurious_fault() · 99c29133
      Gerald Schaefer authored
      
      
      s390 can do more fine-grained handling of spurious TLB protection faults,
      when there also is the PTE pointer available.
      
      Therefore, pass on the PTE pointer to flush_tlb_fix_spurious_fault() as an
      additional parameter.
      
      This will add no functional change to other architectures, but those with
      private flush_tlb_fix_spurious_fault() implementations need to be made
      aware of the new parameter.
      
      Link: https://lkml.kernel.org/r/20230306161548.661740-1-gerald.schaefer@linux.ibm.com
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Reviewed-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      99c29133
    • Sergey Senozhatsky's avatar
      zsmalloc: show per fullness group class stats · e1807d5d
      Sergey Senozhatsky authored
      
      
      We keep the old fullness (3/4 threshold) reporting in
      zs_stats_size_show().  Switch from allmost full/empty stats to
      fine-grained per inuse ratio (fullness group) reporting, which gives
      signicantly more data on classes fragmentation.
      
      Link: https://lkml.kernel.org/r/20230304034835.2082479-5-senozhatsky@chromium.org
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e1807d5d
    • Sergey Senozhatsky's avatar
      zsmalloc: rework compaction algorithm · 5a845e9f
      Sergey Senozhatsky authored
      
      
      The zsmalloc compaction algorithm has the potential to waste some CPU
      cycles, particularly when compacting pages within the same fullness group.
      This is due to the way it selects the head page of the fullness list for
      source and destination pages, and how it reinserts those pages during each
      iteration.  The algorithm may first use a page as a migration destination
      and then as a migration source, leading to an unnecessary back-and-forth
      movement of objects.
      
      Consider the following fullness list:
      
      PageA PageB PageC PageD PageE
      
      During the first iteration, the compaction algorithm will select PageA as
      the source and PageB as the destination.  All of PageA's objects will be
      moved to PageB, and then PageA will be released while PageB is reinserted
      into the fullness list.
      
      PageB PageC PageD PageE
      
      During the next iteration, the compaction algorithm will again select the
      head of the list as the source and destination, meaning that PageB will
      now serve as the source and PageC as the destination.  This will result in
      the objects being moved away from PageB, the same objects that were just
      moved to PageB in the previous iteration.
      
      To prevent this avalanche effect, the compaction algorithm should not
      reinsert the destination page between iterations.  By doing so, the most
      optimal page will continue to be used and its usage ratio will increase,
      reducing internal fragmentation.  The destination page should only be
      reinserted into the fullness list if:
      - It becomes full
      - No source page is available.
      
      TEST
      ====
      
      It's very challenging to reliably test this series.  I ended up developing
      my own synthetic test that has 100% reproducibility.  The test generates
      significan fragmentation (for each size class) and then performs
      compaction for each class individually and tracks the number of memcpy()
      in zs_object_copy(), so that we can compare the amount work compaction
      does on per-class basis.
      
      Total amount of work (zram mm_stat objs_moved)
      ----------------------------------------------
      
      Old fullness grouping, old compaction algorithm:
      323977 memcpy() in zs_object_copy().
      
      Old fullness grouping, new compaction algorithm:
      262944 memcpy() in zs_object_copy().
      
      New fullness grouping, new compaction algorithm:
      213978 memcpy() in zs_object_copy().
      
      Per-class compaction memcpy() comparison (T-test)
      -------------------------------------------------
      
      x Old fullness grouping, old compaction algorithm
      + Old fullness grouping, new compaction algorithm
      
          N           Min           Max        Median           Avg        Stddev
      x 140           349          3513          2461     2314.1214     806.03271
      + 140           289          2778          2006     1878.1714     641.02073
      Difference at 95.0% confidence
              -435.95 +/- 170.595
              -18.8387% +/- 7.37193%
              (Student's t, pooled s = 728.216)
      
      x Old fullness grouping, old compaction algorithm
      + New fullness grouping, new compaction algorithm
      
          N           Min           Max        Median           Avg        Stddev
      x 140           349          3513          2461     2314.1214     806.03271
      + 140           226          2279          1644     1528.4143     524.85268
      Difference at 95.0% confidence
              -785.707 +/- 159.331
              -33.9527% +/- 6.88516%
              (Student's t, pooled s = 680.132)
      
      Link: https://lkml.kernel.org/r/20230304034835.2082479-4-senozhatsky@chromium.org
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5a845e9f
    • Sergey Senozhatsky's avatar
      zsmalloc: fine-grained inuse ratio based fullness grouping · 4c7ac972
      Sergey Senozhatsky authored
      
      
      Each zspage maintains ->inuse counter which keeps track of the number of
      objects stored in the zspage.  The ->inuse counter also determines the
      zspage's "fullness group" which is calculated as the ratio of the "inuse"
      objects to the total number of objects the zspage can hold
      (objs_per_zspage).  The closer the ->inuse counter is to objs_per_zspage,
      the better.
      
      Each size class maintains several fullness lists, that keep track of
      zspages of particular "fullness".  Pages within each fullness list are
      stored in random order with regard to the ->inuse counter.  This is
      because sorting the zspages by ->inuse counter each time obj_malloc() or
      obj_free() is called would be too expensive.  However, the ->inuse counter
      is still a crucial factor in many situations.
      
      For the two major zsmalloc operations, zs_malloc() and zs_compact(), we
      typically select the head zspage from the corresponding fullness list as
      the best candidate zspage.  However, this assumption is not always
      accurate.
      
      For the zs_malloc() operation, the optimal candidate zspage should have
      the highest ->inuse counter.  This is because the goal is to maximize the
      number of ZS_FULL zspages and make full use of all allocated memory.
      
      For the zs_compact() operation, the optimal source zspage should have the
      lowest ->inuse counter.  This is because compaction needs to move objects
      in use to another page before it can release the zspage and return its
      physical pages to the buddy allocator.  The fewer objects in use, the
      quicker compaction can release the zspage.  Additionally, compaction is
      measured by the number of pages it releases.
      
      This patch reworks the fullness grouping mechanism.  Instead of having two
      groups - ZS_ALMOST_EMPTY (usage ratio below 3/4) and ZS_ALMOST_FULL (usage
      ration above 3/4) - that result in too many zspages being included in the
      ALMOST_EMPTY group for specific classes, size classes maintain a larger
      number of fullness lists that give strict guarantees on the minimum and
      maximum ->inuse values within each group.  Each group represents a 10%
      change in the ->inuse ratio compared to neighboring groups.  In essence,
      there are groups for zspages with 0%, 10%, 20% usage ratios, and so on, up
      to 100%.
      
      This enhances the selection of candidate zspages for both zs_malloc() and
      zs_compact().  A printout of the ->inuse counters of the first 7 zspages
      per (random) class fullness group:
      
       class-768 objs_per_zspage 16:
         fullness 100%:  empty
         fullness  99%:  empty
         fullness  90%:  empty
         fullness  80%:  empty
         fullness  70%:  empty
         fullness  60%:  8  8  9  9  8  8  8
         fullness  50%:  empty
         fullness  40%:  5  5  6  5  5  5  5
         fullness  30%:  4  4  4  4  4  4  4
         fullness  20%:  2  3  2  3  3  2  2
         fullness  10%:  1  1  1  1  1  1  1
         fullness   0%:  empty
      
      The zs_malloc() function searches through the groups of pages starting
      with the one having the highest usage ratio.  This means that it always
      selects a zspage from the group with the least internal fragmentation
      (highest usage ratio) and makes it even less fragmented by increasing its
      usage ratio.
      
      The zs_compact() function, on the other hand, begins by scanning the group
      with the highest fragmentation (lowest usage ratio) to locate the source
      page.  The first available zspage is selected, and then the function moves
      downward to find a destination zspage in the group with the lowest
      internal fragmentation (highest usage ratio).
      
      Link: https://lkml.kernel.org/r/20230304034835.2082479-3-senozhatsky@chromium.org
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4c7ac972
    • Sergey Senozhatsky's avatar
      zsmalloc: remove insert_zspage() ->inuse optimization · a40a71e8
      Sergey Senozhatsky authored
      
      
      Patch series "zsmalloc: fine-grained fullness and new compaction
      algorithm", v4.
      
      Existing zsmalloc page fullness grouping leads to suboptimal page
      selection for both zs_malloc() and zs_compact().  This patchset reworks
      zsmalloc fullness grouping/classification.
      
      Additinally it also implements new compaction algorithm that is expected
      to use less CPU-cycles (as it potentially does fewer memcpy-s in
      zs_object_copy()).
      
      Test (synthetic) results can be seen in patch 0003.
      
      
      This patch (of 4):
      
      This optimization has no effect.  It only ensures that when a zspage was
      added to its corresponding fullness list, its "inuse" counter was higher
      or lower than the "inuse" counter of the zspage at the head of the list. 
      The intention was to keep busy zspages at the head, so they could be
      filled up and moved to the ZS_FULL fullness group more quickly.  However,
      this doesn't work as the "inuse" counter of a zspage can be modified by
      obj_free() but the zspage may still belong to the same fullness list.  So,
      fix_fullness_group() won't change the zspage's position in relation to the
      head's "inuse" counter, leading to a largely random order of zspages
      within the fullness list.
      
      For instance, consider a printout of the "inuse" counters of the first 10
      zspages in a class that holds 93 objects per zspage:
      
       ZS_ALMOST_EMPTY:  36  67  68  64  35  54  63  52
      
      As we can see the zspage with the lowest "inuse" counter
      is actually the head of the fullness list.
      
      Remove this pointless "optimisation".
      
      Link: https://lkml.kernel.org/r/20230304034835.2082479-1-senozhatsky@chromium.org
      Link: https://lkml.kernel.org/r/20230304034835.2082479-2-senozhatsky@chromium.org
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a40a71e8
    • Jaewon Kim's avatar
      dma-buf: system_heap: avoid reclaim for order 4 · 3ccefdea
      Jaewon Kim authored
      
      
      Using order 4 pages would be helpful for IOMMUs mapping, but trying to get
      order 4 pages could spend quite much time in the page allocation.  From
      the perspective of responsiveness, the deterministic memory allocation
      speed, I think, is quite important.
      
      The order 4 allocation with __GFP_RECLAIM may spend much time in reclaim
      and compation logic.  __GFP_NORETRY also may affect.  These cause
      unpredictable delay.
      
      To get reasonable allocation speed from dma-buf system heap, use
      HIGH_ORDER_GFP for order 4 to avoid reclaim.  And let me remove
      meaningless __GFP_COMP for order 0.
      
      According to my tests, order 4 with MID_ORDER_GFP could get more number
      of order 4 pages but the elapsed times could be very slow.
      
               time	order 8	order 4	order 0
           584 usec	0	160	0
        28,428 usec	0	160	0
       100,701 usec	0	160	0
        76,645 usec	0	160	0
        25,522 usec	0	160	0
        38,798 usec	0	160	0
        89,012 usec	0	160	0
        23,015 usec	0	160	0
        73,360 usec	0	160	0
        76,953 usec	0	160	0
        31,492 usec	0	160	0
        75,889 usec	0	160	0
        84,551 usec	0	160	0
        84,352 usec	0	160	0
        57,103 usec	0	160	0
        93,452 usec	0	160	0
      
      If HIGH_ORDER_GFP is used for order 4, the number of order 4 could be
      decreased but the elapsed time results were quite stable and fast enough.
      
               time	order 8	order 4	order 0
         1,356 usec	0	155	80
         1,901 usec	0	11	2384
         1,912 usec	0	0	2560
         1,911 usec	0	0	2560
         1,884 usec	0	0	2560
         1,577 usec	0	0	2560
         1,366 usec	0	0	2560
         1,711 usec	0	0	2560
         1,635 usec	0	28	2112
           544 usec	10	0	0
           633 usec	2	128	0
           848 usec	0	160	0
           729 usec	0	160	0
         1,000 usec	0	160	0
         1,358 usec	0	160	0
         2,638 usec	0	31	2064
      
      Link: https://lkml.kernel.org/r/20230303050332.10138-1-jaewon31.kim@samsung.com
      Signed-off-by: default avatarJaewon Kim <jaewon31.kim@samsung.com>
      Reviewed-by: default avatarJohn Stultz <jstultz@google.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: T.J. Mercier <tjmercier@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3ccefdea
    • Alexander Potapenko's avatar
      kmsan: add memsetXX tests · 78c74aee
      Alexander Potapenko authored
      
      
      Add tests ensuring that memset16()/memset32()/memset64() are instrumented
      by KMSAN and correctly initialize the memory.
      
      Link: https://lkml.kernel.org/r/20230303141433.3422671-4-glider@google.com
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      78c74aee
    • Alexander Potapenko's avatar
      x86: kmsan: use C versions of memset16/memset32/memset64 · 27f644dc
      Alexander Potapenko authored
      
      
      KMSAN must see as many memory accesses as possible to prevent false
      positive reports.  Fall back to versions of
      memset16()/memset32()/memset64() implemented in lib/string.c instead of
      those written in assembly.
      
      Link: https://lkml.kernel.org/r/20230303141433.3422671-3-glider@google.com
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Suggested-by: default avatarTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      27f644dc
    • Alexander Potapenko's avatar
      kmsan: another take at fixing memcpy tests · d3402925
      Alexander Potapenko authored
      commit 5478afc5
      
       ("kmsan: fix memcpy tests") uses OPTIMIZER_HIDE_VAR()
      to hide the uninitialized var from the compiler optimizations.
      
      However OPTIMIZER_HIDE_VAR(uninit) enforces an immediate check of @uninit,
      so memcpy tests did not actually check the behavior of memcpy(), because
      they always contained a KMSAN report.
      
      Replace OPTIMIZER_HIDE_VAR() with a file-local macro that just clobbers
      the memory with a barrier(), and add a test case for memcpy() that does
      not expect an error report.
      
      Also reflow kmsan_test.c with clang-format.
      
      Link: https://lkml.kernel.org/r/20230303141433.3422671-2-glider@google.com
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d3402925
    • Alexander Potapenko's avatar
      x86: kmsan: don't rename memintrinsics in uninstrumented files · 6dc4bd4e
      Alexander Potapenko authored
      
      
      clang -fsanitize=kernel-memory already replaces calls to
      memset/memcpy/memmove and their __builtin_ versions with
      __msan_memset/__msan_memcpy/__msan_memmove in instrumented files, so
      there is no need to override them.
      
      In non-instrumented versions we are now required to leave memset() and
      friends intact, so we cannot replace them with __msan_XXX() functions.
      
      Link: https://lkml.kernel.org/r/20230303141433.3422671-1-glider@google.com
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Suggested-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6dc4bd4e
    • Peter Xu's avatar
      mm/khugepaged: cleanup memcg uncharge for failure path · 7cb1d7ef
      Peter Xu authored
      
      
      Explicit memcg uncharging is not needed when the memcg accounting has the
      same lifespan of the page/folio.  That becomes the case for khugepaged
      after Yang & Zach's recent rework so the hpage will be allocated for each
      collapse rather than being cached.
      
      Cleanup the explicit memcg uncharge in khugepaged failure path and leave
      that for put_page().
      
      Link: https://lkml.kernel.org/r/20230303151218.311015-1-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Suggested-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: David Stevens <stevensd@chromium.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7cb1d7ef
    • Anshuman Khandual's avatar
      mm/debug_vm_pgtable: replace pte_mkhuge() with arch_make_huge_pte() · 9dabf6e1
      Anshuman Khandual authored
      Since the following commit arch_make_huge_pte() should be used directly in
      generic memory subsystem as a platform provided page table helper, instead
      of pte_mkhuge().  Change hugetlb_basic_tests() to call
      arch_make_huge_pte() directly, and update its relevant documentation entry
      as required.
      
      'commit 16785bd7
      
       ("mm: merge pte_mkhuge() call into arch_make_huge_pte()")'
      
      Link: https://lkml.kernel.org/r/20230302114845.421674-1-anshuman.khandual@arm.com
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reported-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
        Link: https://lore.kernel.org/all/1ea45095-0926-a56a-a273-816709e9075e@csgroup.eu/
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9dabf6e1
    • Anshuman Khandual's avatar
      mm/migrate: drop pte_mkhuge() in remove_migration_pte() · 1da28f1b
      Anshuman Khandual authored
      Since the following commit, arch_make_huge_pte() should be used directly
      in generic memory subsystem as a platform provided page table helper,
      instead of pte_mkhuge().  This just drops pte_mkhuge() from
      remove_migration_pte(), which has now become redundant.
      
      'commit 16785bd7
      
       ("mm: merge pte_mkhuge() call into arch_make_huge_pte()")'
      
      Link: https://lkml.kernel.org/r/20230302025349.358341-1-anshuman.khandual@arm.com
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reported-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
        Link: https://lore.kernel.org/all/1ea45095-0926-a56a-a273-816709e9075e@csgroup.eu/
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1da28f1b
    • Kefeng Wang's avatar
      mm: swap: remove unneeded cgroup_throttle_swaprate() · 3e4fb13a
      Kefeng Wang authored
      
      
      All the callers of cgroup_throttle_swaprate() are converted to
      folio_throttle_swaprate(), so make __cgroup_throttle_swaprate() to take a
      folio, and rename it to __folio_throttle_swaprate(), also rename gfp_mask
      to gfp and drop redundant extern keyword.  finally, drop unused
      cgroup_throttle_swaprate().
      
      Link: https://lkml.kernel.org/r/20230302115835.105364-8-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3e4fb13a
    • Kefeng Wang's avatar
      mm: memory: use folio_throttle_swaprate() in do_cow_fault() · 68fa572b
      Kefeng Wang authored
      
      
      Directly use folio_throttle_swaprate() instead of
      cgroup_throttle_swaprate().
      
      Link: https://lkml.kernel.org/r/20230302115835.105364-7-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      68fa572b
    • Kefeng Wang's avatar
      mm: memory: use folio_throttle_swaprate() in do_anonymous_page() · e2bf3e2c
      Kefeng Wang authored
      
      
      Directly use folio_throttle_swaprate() instead of
      cgroup_throttle_swaprate().
      
      Link: https://lkml.kernel.org/r/20230302115835.105364-6-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e2bf3e2c
    • Kefeng Wang's avatar
      mm: memory: use folio_throttle_swaprate() in wp_page_copy() · 4d4f75bf
      Kefeng Wang authored
      
      
      Directly use folio_throttle_swaprate() instead of
      cgroup_throttle_swaprate().
      
      Link: https://lkml.kernel.org/r/20230302115835.105364-5-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4d4f75bf
    • Kefeng Wang's avatar
      mm: memory: use folio_throttle_swaprate() in page_copy_prealloc() · e601ded4
      Kefeng Wang authored
      
      
      Directly use folio_throttle_swaprate() instead of
      cgroup_throttle_swaprate().
      
      Link: https://lkml.kernel.org/r/20230302115835.105364-4-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e601ded4
    • Kefeng Wang's avatar
      mm: memory: use folio_throttle_swaprate() in do_swap_page() · 4231f842
      Kefeng Wang authored
      
      
      Directly use folio_throttle_swaprate() instead of
      cgroup_throttle_swaprate().
      
      Link: https://lkml.kernel.org/r/20230302115835.105364-3-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4231f842
    • Kefeng Wang's avatar
      mm: huge_memory: convert __do_huge_pmd_anonymous_page() to use a folio · cfe3236d
      Kefeng Wang authored
      
      
      Patch series "mm: remove cgroup_throttle_swaprate() completely", v2.
      
      Convert all the caller functions of cgroup_throttle_swaprate() to use
      folios, and use folio_throttle_swaprate(), which allows us to remove
      cgroup_throttle_swaprate() completely.
      
      
      This patch (of 7):
      
      Convert from page to folio within __do_huge_pmd_anonymous_page(), as we
      need the precise page which is to be stored at this PTE in the folio, the
      function still keep a page as the parameter.
      
      Link: https://lkml.kernel.org/r/20230302115835.105364-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20230302115835.105364-2-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cfe3236d
    • Peter Collingbourne's avatar
      kasan: call clear_page with a match-all tag instead of changing page tag · 16d91faf
      Peter Collingbourne authored
      
      
      Instead of changing the page's tag solely in order to obtain a pointer
      with a match-all tag and then changing it back again, just convert the
      pointer that we get from kmap_atomic() into one with a match-all tag
      before passing it to clear_page().
      
      On a certain microarchitecture, this has been observed to cause a
      measurable improvement in microbenchmark performance, presumably as a
      result of being able to avoid the atomic operations on the page tag.
      
      Link: https://lkml.kernel.org/r/20230216195924.3287772-1-pcc@google.com
      Signed-off-by: default avatarPeter Collingbourne <pcc@google.com>
      Link: https://linux-review.googlesource.com/id/I0249822cc29097ca7a04ad48e8eb14871f80e711
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      16d91faf
    • Ivan Orlov's avatar
      selftests: cgroup: add 'malloc' failures checks in test_memcontrol · af7df1c9
      Ivan Orlov authored
      
      
      There are several 'malloc' calls in test_memcontrol, which can be
      unsuccessful.  This patch will add 'malloc' failures checking to give more
      details about test's fail reasons and avoid possible undefined behavior
      during the future null dereference (like the one in
      alloc_anon_50M_check_swap function).
      
      Link: https://lkml.kernel.org/r/20230226131634.34366-1-ivan.orlov0322@gmail.com
      Signed-off-by: default avatarIvan Orlov <ivan.orlov0322@gmail.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      af7df1c9
    • Uros Bizjak's avatar
      mm/rmap: use atomic_try_cmpxchg in set_tlb_ubc_flush_pending · bdeb9188
      Uros Bizjak authored
      
      
      Use atomic_try_cmpxchg instead of atomic_cmpxchg (*ptr, old, new) == old
      in set_tlb_ubc_flush_pending.  86 CMPXCHG instruction returns success in
      ZF flag, so this change saves a compare after cmpxchg (and related move
      instruction in front of cmpxchg).
      
      Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
      fails.
      
      No functional change intended.
      
      Link: https://lkml.kernel.org/r/20230227214228.3533299-1-ubizjak@gmail.com
      Signed-off-by: default avatarUros Bizjak <ubizjak@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bdeb9188
    • Hyeonggon Yoo's avatar
      mm/debug: use %pGt to display page_type in dump_page() · f2421a16
      Hyeonggon Yoo authored
      
      
      Some page flags are stored in page_type rather than ->flags field.
      Use newly introduced page type %pGt in dump_page().
      
      Below are some examples:
      
      page:00000000da7184dd refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x101cb3
      flags: 0x2ffff0000000000(node=0|zone=2|lastcpupid=0xffff)
      page_type: 0xffffffff()
      raw: 02ffff0000000000 0000000000000000 dead000000000122 0000000000000000
      raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
      page dumped because: newly allocated page
      
      page:00000000da7184dd refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 pfn:0x101cb3
      flags: 0x2ffff0000000000(node=0|zone=2|lastcpupid=0xffff)
      page_type: 0xffffff7f(buddy)
      raw: 02ffff0000000000 ffff88813fff8e80 ffff88813fff8e80 0000000000000000
      raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000
      page dumped because: freed page
      
      page:0000000042202316 refcount:3 mapcount:2 mapping:0000000000000000 index:0x7f634722a pfn:0x11994e
      memcg:ffff888100135000
      anon flags: 0x2ffff0000080024(uptodate|active|swapbacked|node=0|zone=2|lastcpupid=0xffff)
      page_type: 0x1()
      raw: 02ffff0000080024 0000000000000000 dead000000000122 ffff8881193398f1
      raw: 00000007f634722a 0000000000000000 0000000300000001 ffff888100135000
      page dumped because: user-mapped page
      
      Link: https://lkml.kernel.org/r/20230130042514.2418-4-42.hyeyoo@gmail.com
      Signed-off-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: John Ogness <john.ogness@linutronix.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f2421a16
    • Hyeonggon Yoo's avatar
      mm, printk: introduce new format %pGt for page_type · 4c85c0be
      Hyeonggon Yoo authored
      
      
      %pGp format is used to display 'flags' field of a struct page.  However,
      some page flags (i.e.  PG_buddy, see page-flags.h for more details) are
      stored in page_type field.  To display human-readable output of page_type,
      introduce %pGt format.
      
      It is important to note the meaning of bits are different in page_type. 
      if page_type is 0xffffffff, no flags are set.  Setting PG_buddy
      (0x00000080) flag results in a page_type of 0xffffff7f.  Clearing a bit
      actually means setting a flag.  Bits in page_type are inverted when
      displaying type names.
      
      Only values for which page_type_has_type() returns true are considered as
      page_type, to avoid confusion with mapcount values.  if it returns false,
      only raw values are displayed and not page type names.
      
      Link: https://lkml.kernel.org/r/20230130042514.2418-3-42.hyeyoo@gmail.com
      Signed-off-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Reviewed-by: Petr Mladek <pmladek@suse.com>	[vsprintf part]
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: John Ogness <john.ogness@linutronix.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4c85c0be
    • Hyeonggon Yoo's avatar
      mmflags.h: use less error prone method to define pageflag_names · e26fcc02
      Hyeonggon Yoo authored
      
      
      Patch series "mm, printk: introduce new format for page_type", v4.
      
      This series moves PG_slab page flag to page_type, freeing one bit in
      page->flags and introduces %pGt format that prints human-readable
      page_type like %pGp for printing page flags.
      
      See changelog of patch 2 for more implementation details.
      
      Thanks everyone that gave valuable comments.
      
      
      This patch (of 3):
      
      Use helper macro to decrease chances of typo when defining pageflag_names.
      
      Link: https://lkml.kernel.org/r/20230130042514.2418-1-42.hyeyoo@gmail.com
      Link: https://lore.kernel.org/lkml/Y6AycLbpjVzXM5I9@smile.fi.intel.com
      Link: https://lkml.kernel.org/r/20230130042514.2418-2-42.hyeyoo@gmail.com
      Signed-off-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Suggested-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Reviewed-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: John Ogness <john.ogness@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e26fcc02
    • Stefan Roesch's avatar
      mm: add tracepoints to ksm · 739100c8
      Stefan Roesch authored
      
      
      This adds the following tracepoints to ksm:
      - start / stop scan
      - ksm enter / exit
      - merge a page
      - merge a page with ksm
      - remove a page
      - remove a rmap item
      
      This patch has been split off from the RFC patch series "mm:
      process/cgroup ksm support".
      
      Link: https://lkml.kernel.org/r/20230210214645.2720847-1-shr@devkernel.io
      Signed-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Reviewed-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      739100c8
    • Nicholas Piggin's avatar
      powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN · 77f68ebe
      Nicholas Piggin authored
      
      
      On a 16-socket 192-core POWER8 system, the context_switch1_threads
      benchmark from will-it-scale (see earlier changelog), upstream can achieve
      a rate of about 1 million context switches per second, due to contention
      on the mm refcount.
      
      64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
      the option.  This increases the above benchmark to 118 million context
      switches per second.
      
      This generates 314 additional IPI interrupts on a 144 CPU system doing a
      kernel compile, which is in the noise in terms of kernel cycles.
      
      Link: https://lkml.kernel.org/r/20230203071837.1136453-6-npiggin@gmail.com
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      77f68ebe
    • Nicholas Piggin's avatar
      lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme · 2655421a
      Nicholas Piggin authored
      
      
      On big systems, the mm refcount can become highly contented when doing a
      lot of context switching with threaded applications.  user<->idle switch
      is one of the important cases.  Abandoning lazy tlb entirely slows this
      switching down quite a bit in the common uncontended case, so that is not
      viable.
      
      Implement a scheme where lazy tlb mm references do not contribute to the
      refcount, instead they get explicitly removed when the refcount reaches
      zero.
      
      The final mmdrop() sends IPIs to all CPUs in the mm_cpumask and they
      switch away from this mm to init_mm if it was being used as the lazy tlb
      mm.  Enabling the shoot lazies option therefore requires that the arch
      ensures that mm_cpumask contains all CPUs that could possibly be using mm.
      A DEBUG_VM option IPIs every CPU in the system after this to ensure there
      are no references remaining before the mm is freed.
      
      Shootdown IPIs cost could be an issue, but they have not been observed to
      be a serious problem with this scheme, because short-lived processes tend
      not to migrate CPUs much, therefore they don't get much chance to leave
      lazy tlb mm references on remote CPUs.  There are a lot of options to
      reduce them if necessary, described in comments.
      
      The near-worst-case can be benchmarked with will-it-scale:
      
        context_switch1_threads -t $(($(nproc) / 2))
      
      This will create nproc threads (nproc / 2 switching pairs) all sharing the
      same mm that spread over all CPUs so each CPU does thread->idle->thread
      switching.
      
      [ Rik came up with basically the same idea a few years ago, so credit
        to him for that. ]
      
      Link: https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/
      Link: https://lore.kernel.org/all/20180728215357.3249-11-riel@surriel.com/
      Link: https://lkml.kernel.org/r/20230203071837.1136453-5-npiggin@gmail.com
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2655421a
    • Nicholas Piggin's avatar
      lazy tlb: allow lazy tlb mm refcounting to be configurable · 88e3009b
      Nicholas Piggin authored
      
      
      Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
      when it is context switched.  This can be disabled by architectures that
      don't require this refcounting if they clean up lazy tlb mms when the last
      refcount is dropped.  Currently this is always enabled, so the patch
      introduces no functional change.
      
      Link: https://lkml.kernel.org/r/20230203071837.1136453-4-npiggin@gmail.com
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      88e3009b
    • Nicholas Piggin's avatar
      lazy tlb: introduce lazy tlb mm refcount helper functions · aa464ba9
      Nicholas Piggin authored
      
      
      Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. 
      This makes the lazy tlb mm references more obvious, and allows the
      refcounting scheme to be modified in later changes.  There is no
      functional change with this patch.
      
      Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      aa464ba9
    • Nicholas Piggin's avatar
      kthread: simplify kthread_use_mm refcounting · 6cad87b0
      Nicholas Piggin authored
      
      
      Patch series "shoot lazy tlbs (lazy tlb refcount scalability
      improvement)", v7.
      
      This series improves scalability of context switching between user and
      kernel threads on large systems with a threaded process spread across a
      lot of CPUs.
      
      Discussion of v6 here:
      https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/
      
      
      This patch (of 5):
      
      Remove the special case avoiding refcounting when the mm to be used is the
      same as the kernel thread's active (lazy tlb) mm.  kthread_use_mm() should
      not be such a performance critical path that this matters much.  This
      simplifies a later change to lazy tlb mm refcounting.
      
      Link: https://lkml.kernel.org/r/20230203071837.1136453-1-npiggin@gmail.com
      Link: https://lkml.kernel.org/r/20230203071837.1136453-2-npiggin@gmail.com
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6cad87b0
    • Taejoon Song's avatar
      mm/zswap: try to avoid worst-case scenario on same element pages · 62bf1258
      Taejoon Song authored
      The worst-case scenario on finding same element pages is that almost all
      elements are same at the first glance but only last few elements are
      different.
      
      Since the same element tends to be grouped from the beginning of the
      pages, if we check the first element with the last element before looping
      through all elements, we might have some chances to quickly detect
      non-same element pages.
      
      1. Test is done under LG webOS TV (64-bit arch)
      2. Dump the swap-out pages (~819200 pages)
      3. Analyze the pages with simple test script which counts the iteration
         number and measures the speed at off-line
      
      Under 64-bit arch, the worst iteration count is PAGE_SIZE / 8 bytes = 512.
      The speed is based on the time to consume page_same_filled() function
      only.  The result, on average, is listed as below:
      
                                         Num of Iter    Speed(MB/s)
      Looping-Forward (Orig)                 38            99265
      Looping-Backward                       36           102725
      Last-element-check (This Patch)        33           125072
      
      The result shows that the average iteration count decreases by 13% and the
      speed increases by 25% with this patch.  This patch does not increase the
      overall time complexity, though.
      
      I also ran simpler version which uses backward loop.  Just looping
      backward also makes some improvement, but less than this patch.
      
      A similar change has already been made to zram in 90f82cbf
      
       ("zram: try
      to avoid worst-case scenario on same element pages").
      
      Link: https://lkml.kernel.org/r/20230205190036.1730134-1-taejoon.song@lge.com
      Signed-off-by: default avatarTaejoon Song <taejoon.song@lge.com>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Taejoon Song <taejoon.song@lge.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <yjay.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      62bf1258
    • T.J. Alumbaugh's avatar
      mm: multi-gen LRU: improve design doc · 32d32ef1
      T.J. Alumbaugh authored
      
      
      This patch improves the design doc. Specifically,
        1. add a section for the per-memcg mm_struct list, and
        2. add a section for the PID controller.
      
      Link: https://lkml.kernel.org/r/20230214035445.1250139-2-talumbau@google.com
      Signed-off-by: default avatarT.J. Alumbaugh <talumbau@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      32d32ef1
    • T.J. Alumbaugh's avatar
      mm: multi-gen LRU: clean up sysfs code · 9a52b2f3
      T.J. Alumbaugh authored
      
      
      This patch cleans up the sysfs code. Specifically,
        1. use sysfs_emit(),
        2. use __ATTR_RW(), and
        3. constify multi-gen LRU struct attribute_group.
      
      Link: https://lkml.kernel.org/r/20230214035445.1250139-1-talumbau@google.com
      Signed-off-by: default avatarT.J. Alumbaugh <talumbau@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9a52b2f3
    • Ma Wupeng's avatar
      x86/mm/pat: clear VM_PAT if copy_p4d_range failed · d155df53
      Ma Wupeng authored
      
      
      Syzbot reports a warning in untrack_pfn().  Digging into the root we found
      that this is due to memory allocation failure in pmd_alloc_one.  And this
      failure is produced due to failslab.
      
      In copy_page_range(), memory alloaction for pmd failed.  During the error
      handling process in copy_page_range(), mmput() is called to remove all
      vmas.  While untrack_pfn this empty pfn, warning happens.
      
      Here's a simplified flow:
      
      dup_mm
        dup_mmap
          copy_page_range
            copy_p4d_range
              copy_pud_range
                copy_pmd_range
                  pmd_alloc
                    __pmd_alloc
                      pmd_alloc_one
                        page = alloc_pages(gfp, 0);
                          if (!page)
                            return NULL;
          mmput
              exit_mmap
                unmap_vmas
                  unmap_single_vma
                    untrack_pfn
                      follow_phys
                        WARN_ON_ONCE(1);
      
      Since this vma is not generate successfully, we can clear flag VM_PAT.  In
      this case, untrack_pfn() will not be called while cleaning this vma.
      
      Function untrack_pfn_moved() has also been renamed to fit the new logic.
      
      Link: https://lkml.kernel.org/r/20230217025615.1595558-1-mawupeng1@huawei.com
      Signed-off-by: default avatarMa Wupeng <mawupeng1@huawei.com>
      Reported-by: default avatar <syzbot+5f488e922d047d8f00cc@syzkaller.appspotmail.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d155df53
    • Muhammad Usama Anjum's avatar
      mm/userfaultfd: support WP on multiple VMAs · a1b92a3f
      Muhammad Usama Anjum authored
      
      
      mwriteprotect_range() errors out if [start, end) doesn't fall in one VMA. 
      We are facing a use case where multiple VMAs are present in one range of
      interest.  For example, the following pseudocode reproduces the error
      which we are trying to fix:
      
      - Allocate memory of size 16 pages with PROT_NONE with mmap
      - Register userfaultfd
      - Change protection of the first half (1 to 8 pages) of memory to
        PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs.
      - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors
        out.
      
      This is a simple use case where user may or may not know if the memory
      area has been divided into multiple VMAs.
      
      We need an implementation which doesn't disrupt the already present users.
      So keeping things simple, stop going over all the VMAs if any one of the
      VMA hasn't been registered in WP mode.  While at it, remove the un-needed
      error check as well.
      
      [akpm@linux-foundation.org: s/VM_WARN_ON_ONCE/VM_WARN_ONCE/ to fix build]
      Link: https://lkml.kernel.org/r/20230217105558.832710-1-usama.anjum@collabora.com
      Signed-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarPaul Gofman <pgofman@codeweavers.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a1b92a3f
    • Vlastimil Babka's avatar
      mm, page_alloc: reduce page alloc/free sanity checks · 700d2e9a
      Vlastimil Babka authored
      Historically, we have performed sanity checks on all struct pages being
      allocated or freed, making sure they have no unexpected page flags or
      certain field values.  This can detect insufficient cleanup and some cases
      of use-after-free, although on its own it can't always identify the
      culprit.  The result is a warning and the "bad page" being leaked.
      
      The checks do need some cpu cycles, so in 4.7 with commits 479f854a
      ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
      and 4db7548c ("mm, page_alloc: defer debugging checks of freed pages
      until a PCP drain") they were no longer performed in the hot paths when
      allocating and freeing from pcplists, but only when pcplists are bypassed,
      refilled or drained.  For debugging purposes, with CONFIG_DEBUG_VM enabled
      the checks were instead still done in the hot paths and not when refilling
      or draining pcplists.
      
      With 4462b32c
      
       ("mm, page_alloc: more extensive free page checking with
      debug_pagealloc"), enabling debug_pagealloc also moved the sanity checks
      back to hot pahs.  When both debug_pagealloc and CONFIG_DEBUG_VM are
      enabled, the checks are done both in hotpaths and pcplist refill/drain.
      
      Even though the non-debug default today might seem to be a sensible
      tradeoff between overhead and ability to detect bad pages, on closer look
      it's arguably not.  As most allocations go through the pcplists, catching
      any bad pages when refilling or draining pcplists has only a small chance,
      insufficient for debugging or serious hardening purposes.  On the other
      hand the cost of the checks is concentrated in the already expensive
      drain/refill batching operations, and those are done under the often
      contended zone lock.  That was recently identified as an issue for page
      allocation and the zone lock contention reduced by moving the checks
      outside of the locked section with a patch "mm: reduce lock contention of
      pcp buffer refill", but the cost of the checks is still visible compared
      to their removal [1].  In the pcplist draining path free_pcppages_bulk()
      the checks are still done under zone->lock.
      
      Thus, remove the checks from pcplist refill and drain paths completely.
      Introduce a static key check_pages_enabled to control checks during page
      allocation a freeing (whether pcplist is used or bypassed). The static
      key is enabled if either is true:
      
      - kernel is built with CONFIG_DEBUG_VM=y (debugging)
      - debug_pagealloc or page poisoning is boot-time enabled (debugging)
      - init_on_alloc or init_on_free is boot-time enabled (hardening)
      
      The resulting user visible changes:
      - no checks when draining/refilling pcplists - less overhead, with
        likely no practical reduction of ability to catch bad pages
      - no checks when bypassing pcplists in default config (no
        debugging/hardening) - less overhead etc. as above
      - on typical hardened kernels [2], checks are now performed on each page
        allocation/free (previously only when bypassing/draining/refilling
        pcplists) - the init_on_alloc/init_on_free enabled should be sufficient
        indication for preferring more costly alloc/free operations for
        hardening purposes and we shouldn't need to introduce another toggle
      - code (various wrappers) removal and simplification
      
      [1] https://lore.kernel.org/all/68ba44d8-6899-c018-dcb3-36f3a96e6bea@sra.uni-hannover.de/
      [2] https://lore.kernel.org/all/63ebc499.a70a0220.9ac51.29ea@mx.google.com/
      
      [akpm@linux-foundation.org: coding-style cleanups]
      [akpm@linux-foundation.org: make check_pages_enabled static]
      Link: https://lkml.kernel.org/r/20230216095131.17336-1-vbabka@suse.cz
      Reported-by: default avatarAlexander Halbuer <halbuer@sra.uni-hannover.de>
      Reported-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      700d2e9a
    • Alexander Halbuer's avatar
      mm: reduce lock contention of pcp buffer refill · 2ede3c13
      Alexander Halbuer authored
      
      
      rmqueue_bulk() batches the allocation of multiple elements to refill the
      per-CPU buffers into a single hold of the zone lock.  Each element is
      allocated and checked using check_pcp_refill().  The check touches every
      related struct page which is especially expensive for higher order
      allocations (huge pages).
      
      This patch reduces the time holding the lock by moving the check out of
      the critical section similar to rmqueue_buddy() which allocates a single
      element.
      
      Measurements of parallel allocation-heavy workloads show a reduction of
      the average huge page allocation latency of 50 percent for two cores and
      nearly 90 percent for 24 cores.
      
      Link: https://lkml.kernel.org/r/20230201162549.68384-1-halbuer@sra.uni-hannover.de
      Signed-off-by: default avatarAlexander Halbuer <halbuer@sra.uni-hannover.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2ede3c13