Skip to content
  1. Jan 19, 2023
    • Yu Zhao's avatar
      mm: multi-gen LRU: per-node lru_gen_folio lists · e4dde56c
      Yu Zhao authored
      
      
      For each node, memcgs are divided into two generations: the old and
      the young. For each generation, memcgs are randomly sharded into
      multiple bins to improve scalability. For each bin, an RCU hlist_nulls
      is virtually divided into three segments: the head, the tail and the
      default.
      
      An onlining memcg is added to the tail of a random bin in the old
      generation. The eviction starts at the head of a random bin in the old
      generation. The per-node memcg generation counter, whose reminder (mod
      2) indexes the old generation, is incremented when all its bins become
      empty.
      
      There are four operations:
      1. MEMCG_LRU_HEAD, which moves an memcg to the head of a random bin in
         its current generation (old or young) and updates its "seg" to
         "head";
      2. MEMCG_LRU_TAIL, which moves an memcg to the tail of a random bin in
         its current generation (old or young) and updates its "seg" to
         "tail";
      3. MEMCG_LRU_OLD, which moves an memcg to the head of a random bin in
         the old generation, updates its "gen" to "old" and resets its "seg"
         to "default";
      4. MEMCG_LRU_YOUNG, which moves an memcg to the tail of a random bin
         in the young generation, updates its "gen" to "young" and resets
         its "seg" to "default".
      
      The events that trigger the above operations are:
      1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
      2. The first attempt to reclaim an memcg below low, which triggers
         MEMCG_LRU_TAIL;
      3. The first attempt to reclaim an memcg below reclaimable size
         threshold, which triggers MEMCG_LRU_TAIL;
      4. The second attempt to reclaim an memcg below reclaimable size
         threshold, which triggers MEMCG_LRU_YOUNG;
      5. Attempting to reclaim an memcg below min, which triggers
         MEMCG_LRU_YOUNG;
      6. Finishing the aging on the eviction path, which triggers
         MEMCG_LRU_YOUNG;
      7. Offlining an memcg, which triggers MEMCG_LRU_OLD.
      
      Note that memcg LRU only applies to global reclaim, and the
      round-robin incrementing of their max_seq counters ensures the
      eventual fairness to all eligible memcgs. For memcg reclaim, it still
      relies on mem_cgroup_iter().
      
      Link: https://lkml.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e4dde56c
    • Yu Zhao's avatar
      mm: multi-gen LRU: shuffle should_run_aging() · 77d4459a
      Yu Zhao authored
      
      
      Move should_run_aging() next to its only caller left.
      
      Link: https://lkml.kernel.org/r/20221222041905.2431096-6-yuzhao@google.com
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      77d4459a
    • Yu Zhao's avatar
      mm: multi-gen LRU: remove aging fairness safeguard · 7348cc91
      Yu Zhao authored
      
      
      Recall that the aging produces the youngest generation: first it scans
      for accessed folios and updates their gen counters; then it increments
      lrugen->max_seq.
      
      The current aging fairness safeguard for kswapd uses two passes to
      ensure the fairness to multiple eligible memcgs. On the first pass,
      which is shared with the eviction, it checks whether all eligible
      memcgs are low on cold folios. If so, it requires a second pass, on
      which it ages all those memcgs at the same time.
      
      With memcg LRU, the aging, while ensuring eventual fairness, will run
      when necessary. Therefore the current aging fairness safeguard for
      kswapd will not be needed.
      
      Note that memcg LRU only applies to global reclaim. For memcg reclaim,
      the aging can be unfair to different memcgs, i.e., their
      lrugen->max_seq can be incremented at different paces.
      
      Link: https://lkml.kernel.org/r/20221222041905.2431096-5-yuzhao@google.com
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7348cc91
    • Yu Zhao's avatar
      mm: multi-gen LRU: remove eviction fairness safeguard · a579086c
      Yu Zhao authored
      
      
      Recall that the eviction consumes the oldest generation: first it
      bucket-sorts folios whose gen counters were updated by the aging and
      reclaims the rest; then it increments lrugen->min_seq.
      
      The current eviction fairness safeguard for global reclaim has a
      dilemma: when there are multiple eligible memcgs, should it continue
      or stop upon meeting the reclaim goal? If it continues, it overshoots
      and increases direct reclaim latency; if it stops, it loses fairness
      between memcgs it has taken memory away from and those it has yet to.
      
      With memcg LRU, the eviction, while ensuring eventual fairness, will
      stop upon meeting its goal. Therefore the current eviction fairness
      safeguard for global reclaim will not be needed.
      
      Note that memcg LRU only applies to global reclaim. For memcg reclaim,
      the eviction will continue, even if it is overshooting. This becomes
      unconditional due to code simplification.
      
      Link: https://lkml.kernel.org/r/20221222041905.2431096-4-yuzhao@google.com
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a579086c
    • Yu Zhao's avatar
      mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[] · 6df1b221
      Yu Zhao authored
      
      
      lru_gen_folio will be chained into per-node lists by the coming
      lrugen->list.
      
      Link: https://lkml.kernel.org/r/20221222041905.2431096-3-yuzhao@google.com
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6df1b221
    • Yu Zhao's avatar
      mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio · 391655fe
      Yu Zhao authored
      
      
      Patch series "mm: multi-gen LRU: memcg LRU", v3.
      
      Overview
      ========
      
      An memcg LRU is a per-node LRU of memcgs.  It is also an LRU of LRUs,
      since each node and memcg combination has an LRU of folios (see
      mem_cgroup_lruvec()).
      
      Its goal is to improve the scalability of global reclaim, which is
      critical to system-wide memory overcommit in data centers.  Note that
      memcg reclaim is currently out of scope.
      
      Its memory bloat is a pointer to each lruvec and negligible to each
      pglist_data.  In terms of traversing memcgs during global reclaim, it
      improves the best-case complexity from O(n) to O(1) and does not affect
      the worst-case complexity O(n).  Therefore, on average, it has a sublinear
      complexity in contrast to the current linear complexity.
      
      The basic structure of an memcg LRU can be understood by an analogy to
      the active/inactive LRU (of folios):
      1. It has the young and the old (generations), i.e., the counterparts
         to the active and the inactive;
      2. The increment of max_seq triggers promotion, i.e., the counterpart
         to activation;
      3. Other events trigger similar operations, e.g., offlining an memcg
         triggers demotion, i.e., the counterpart to deactivation.
      
      In terms of global reclaim, it has two distinct features:
      1. Sharding, which allows each thread to start at a random memcg (in
         the old generation) and improves parallelism;
      2. Eventual fairness, which allows direct reclaim to bail out at will
         and reduces latency without affecting fairness over some time.
      
      The commit message in patch 6 details the workflow:
      https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com/
      
      The following is a simple test to quickly verify its effectiveness.
      
        Test design:
        1. Create multiple memcgs.
        2. Each memcg contains a job (fio).
        3. All jobs access the same amount of memory randomly.
        4. The system does not experience global memory pressure.
        5. Periodically write to the root memory.reclaim.
      
        Desired outcome:
        1. All memcgs have similar pgsteal counts, i.e., stddev(pgsteal)
           over mean(pgsteal) is close to 0%.
        2. The total pgsteal is close to the total requested through
           memory.reclaim, i.e., sum(pgsteal) over sum(requested) is close
           to 100%.
      
        Actual outcome [1]:
                                           MGLRU off    MGLRU on
        stddev(pgsteal) / mean(pgsteal)    75%          20%
        sum(pgsteal) / sum(requested)      425%         95%
      
        ####################################################################
        MEMCGS=128
      
        for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
            mkdir /sys/fs/cgroup/memcg$memcg
        done
      
        start() {
            echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs
      
            fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \
                --filename=/dev/zero --size=1920M --rw=randrw \
                --rate=64m,64m --random_distribution=random \
                --fadvise_hint=0 --time_based --runtime=10h \
                --group_reporting --minimal
        }
      
        for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
            start &
        done
      
        sleep 600
      
        for ((i = 0; i < 600; i++)); do
            echo 256m >/sys/fs/cgroup/memory.reclaim
            sleep 6
        done
      
        for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
            grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat
        done
        ####################################################################
      
      [1]: This was obtained from running the above script (touches less
           than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an
           hour.
      
      
      This patch (of 8):
      
      The new name lru_gen_folio will be more distinct from the coming
      lru_gen_memcg.
      
      Link: https://lkml.kernel.org/r/20221222041905.2431096-1-yuzhao@google.com
      Link: https://lkml.kernel.org/r/20221222041905.2431096-2-yuzhao@google.com
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      391655fe
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: replace BUG_ON() by WARN_ON_ONCE() · 14687619
      Uladzislau Rezki (Sony) authored
      
      
      Currently a vm_unmap_ram() functions triggers a BUG() if an area is not
      found.  Replace it by the WARN_ON_ONCE() error message and keep machine
      alive instead of stopping it.
      
      The worst case is a memory leaking.
      
      Link: https://lkml.kernel.org/r/20221222190022.134380-3-urezki@gmail.com
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      14687619
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: avoid calling __find_vmap_area() twice in __vunmap() · edd89818
      Uladzislau Rezki (Sony) authored
      
      
      Currently the __vunmap() path calls __find_vmap_area() twice.  Once on
      entry to check that the area exists, then inside the remove_vm_area()
      function which also performs a new search for the VA.
      
      In order to improvie it from a performance point of view we split
      remove_vm_area() into two new parts:
        - find_unlink_vmap_area() that does a search and unlink from tree;
        - __remove_vm_area() that removes without searching.
      
      In this case there is no any functional change for remove_vm_area()
      whereas vm_remove_mappings(), where a second search happens, switches to
      the __remove_vm_area() variant where the already detached VA is passed as
      a parameter, so there is no need to find it again.
      
      Performance wise, i use test_vmalloc.sh with 32 threads doing alloc
      free on a 64-CPUs-x86_64-box:
      
      perf without this patch:
      -   31.41%     0.50%  vmalloc_test/10  [kernel.vmlinux]    [k] __vunmap
         - 30.92% __vunmap
            - 17.67% _raw_spin_lock
                 native_queued_spin_lock_slowpath
            - 12.33% remove_vm_area
               - 11.79% free_vmap_area_noflush
                  - 11.18% _raw_spin_lock
                       native_queued_spin_lock_slowpath
              0.76% free_unref_page
      
      perf with this patch:
      -   11.35%     0.13%  vmalloc_test/14  [kernel.vmlinux]    [k] __vunmap
         - 11.23% __vunmap
            - 8.28% find_unlink_vmap_area
               - 7.95% _raw_spin_lock
                    7.44% native_queued_spin_lock_slowpath
            - 1.93% free_vmap_area_noflush
               - 0.56% _raw_spin_lock
                    0.53% native_queued_spin_lock_slowpath
              0.60% __vunmap_range_noflush
      
      __vunmap() consumes around ~20% less CPU cycles on this test.
      
      Also, switch from find_vmap_area() to find_unlink_vmap_area() to prevent a
      double access to the vmap_area_lock: one for finding area, second time is
      for unlinking from a tree.
      
      [urezki@gmail.com: switch to find_unlink_vmap_area() in vm_unmap_ram()]
        Link: https://lkml.kernel.org/r/20221222190022.134380-2-urezki@gmail.com
      Link: https://lkml.kernel.org/r/20221222190022.134380-1-urezki@gmail.com
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reported-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      edd89818
    • David Howells's avatar
      mm: move FOLL_* defs to mm_types.h · b5054174
      David Howells authored
      
      
      Move FOLL_* definitions to linux/mm_types.h to make them more accessible
      without having to drag in all of linux/mm.h and everything that drags in
      too[1].
      
      Link: https://lkml.kernel.org/r/2161258.1671657894@warthog.procyon.org.uk
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b5054174
    • Hao Sun's avatar
      mm: new primitive kvmemdup() · 0b7b8704
      Hao Sun authored
      
      
      Similar to kmemdup(), but support large amount of bytes with kvmalloc()
      and does *not* guarantee that the result will be physically contiguous. 
      Use only in cases where kvmalloc() is needed and free it with kvfree(). 
      Also adapt policy_unpack.c in case someone bisect into this.
      
      Link: https://lkml.kernel.org/r/20221221144245.27164-1-sunhao.th@gmail.com
      Signed-off-by: default avatarHao Sun <sunhao.th@gmail.com>
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Nick Terrell <terrelln@fb.com>
      Cc: John Johansen <john.johansen@canonical.com>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0b7b8704
    • Vishal Moola (Oracle)'s avatar
      mm/swap: convert deactivate_page() to folio_deactivate() · 5a9e3474
      Vishal Moola (Oracle) authored
      
      
      Deactivate_page() has already been converted to use folios, this change
      converts it to take in a folio argument instead of calling page_folio(). 
      It also renames the function folio_deactivate() to be more consistent with
      other folio functions.
      
      [akpm@linux-foundation.org: fix left-over comments, per Yu Zhao]
      Link: https://lkml.kernel.org/r/20221221180848.20774-5-vishal.moola@gmail.com
      Signed-off-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5a9e3474
    • Vishal Moola (Oracle)'s avatar
      mm/damon: convert damon_pa_mark_accessed_or_deactivate() to use folios · f70da5ee
      Vishal Moola (Oracle) authored
      
      
      This change replaces 2 calls to compound_head() from put_page() and 1 call
      from mark_page_accessed() with one from page_folio().  This is in
      preparation for the conversion of deactivate_page() to folio_deactivate().
      
      Link: https://lkml.kernel.org/r/20221221180848.20774-4-vishal.moola@gmail.com
      Signed-off-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f70da5ee
    • Vishal Moola (Oracle)'s avatar
      madvise: convert madvise_cold_or_pageout_pte_range() to use folios · 07e8c82b
      Vishal Moola (Oracle) authored
      
      
      This change removes a number of calls to compound_head(), and saves
      1729 bytes of kernel text.
      
      Link: https://lkml.kernel.org/r/20221221180848.20774-3-vishal.moola@gmail.com
      Signed-off-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      07e8c82b
    • Vishal Moola (Oracle)'s avatar
      mm/memory: add vm_normal_folio() · 318e9342
      Vishal Moola (Oracle) authored
      
      
      Patch series "Convert deactivate_page() to folio_deactivate()", v4.
      
      Deactivate_page() has already been converted to use folios.  This patch
      series modifies the callers of deactivate_page() to use folios.  It also
      introduces vm_normal_folio() to assist with folio conversions, and
      converts deactivate_page() to folio_deactivate() which takes in a folio.
      
      
      This patch (of 4):
      
      Introduce a wrapper function called vm_normal_folio().  This function
      calls vm_normal_page() and returns the folio of the page found, or null if
      no page is found.
      
      This function allows callers to get a folio from a pte, which will
      eventually allow them to completely replace their struct page variables
      with struct folio instead.
      
      Link: https://lkml.kernel.org/r/20221221180848.20774-1-vishal.moola@gmail.com
      Link: https://lkml.kernel.org/r/20221221180848.20774-2-vishal.moola@gmail.com
      Signed-off-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      318e9342
    • Vernon Yang's avatar
      maple_tree: refine mab_calc_split function · e11cb683
      Vernon Yang authored
      
      
      Invert the conditional judgment of the mid_split, to focus the return
      statement in the last statement, which is easier to understand and for
      better readability.
      
      Link: https://lkml.kernel.org/r/20221221060058.609003-8-vernon2gm@gmail.com
      Signed-off-by: default avatarVernon Yang <vernon2gm@gmail.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e11cb683
    • Vernon Yang's avatar
      maple_tree: refine ma_state init from mas_start() · 46b34584
      Vernon Yang authored
      
      
      If mas->node is an MAS_START, there are three cases, and they all assign
      different values to mas->node and mas->offset.  So there is no need to set
      them to a default value before updating.
      
      Update them directly to make them easier to understand and for better
      readability.
      
      Link: https://lkml.kernel.org/r/20221221060058.609003-7-vernon2gm@gmail.com
      Signed-off-by: default avatarVernon Yang <vernon2gm@gmail.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      46b34584
    • Vernon Yang's avatar
      maple_tree: remove the redundant code · eabb3052
      Vernon Yang authored
      
      
      The macros CONFIG_DEBUG_MAPLE_TREE_VERBOSE no one uses, functions
      mas_dup_tree() and mas_dup_store() are not implemented, just function
      declaration, so drop it.
      
      Link: https://lkml.kernel.org/r/20221221060058.609003-6-vernon2gm@gmail.com
      Signed-off-by: default avatarVernon Yang <vernon2gm@gmail.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eabb3052
    • Vernon Yang's avatar
      maple_tree: use macro MA_ROOT_PARENT instead of number · 84fd3e1e
      Vernon Yang authored
      
      
      When you need to compare whether node->parent is parent of the
      root node, using macro MA_ROOT_PARENT is easier to understand
      and for better readability.
      
      Link: https://lkml.kernel.org/r/20221221060058.609003-5-vernon2gm@gmail.com
      Signed-off-by: default avatarVernon Yang <vernon2gm@gmail.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      84fd3e1e
    • Vernon Yang's avatar
      maple_tree: use mt_node_max() instead of direct operations mt_max[] · bd592703
      Vernon Yang authored
      
      
      Use mt_node_max() to get the maximum number of slots for a node,
      rather than direct operations mt_max[], makes it better portability.
      
      Link: https://lkml.kernel.org/r/20221221060058.609003-4-vernon2gm@gmail.com
      Signed-off-by: default avatarVernon Yang <vernon2gm@gmail.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bd592703
    • Vernon Yang's avatar
      maple_tree: remove extra return statement · d56c593c
      Vernon Yang authored
      
      
      For functions with a return type of void, it is unnecessary to
      add a reurn statement at the end of the function, so drop it.
      
      Link: https://lkml.kernel.org/r/20221221060058.609003-3-vernon2gm@gmail.com
      Signed-off-by: default avatarVernon Yang <vernon2gm@gmail.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d56c593c
    • Vernon Yang's avatar
      maple_tree: remove extra space and blank line · 831978e3
      Vernon Yang authored
      
      
      Patch series "Clean up and refinement for maple tree", v2.
      
      This patchset cleans up and refines some maple tree code.  A few small
      changes make the code easier to understand and for better readability.
      
      
      This patch (of 7):
      
      These extra space and blank lines are unnecessary, so drop them.
      
      Link: https://lkml.kernel.org/r/20221221060058.609003-1-vernon2gm@gmail.com
      Link: https://lkml.kernel.org/r/20221221060058.609003-2-vernon2gm@gmail.com
      Signed-off-by: default avatarVernon Yang <vernon2gm@gmail.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      831978e3
    • Lorenzo Stoakes's avatar
      mm: vmalloc: correct use of __GFP_NOWARN mask in __vmalloc_area_node() · 80b1d8fd
      Lorenzo Stoakes authored
      
      
      This function sets __GFP_NOWARN in the gfp_mask rendering the warn_alloc()
      invocations no-ops.  Remove this and instead rely on this flag being set
      only for the vm_area_alloc_pages() function, ensuring it is cleared for
      each of the warn_alloc() calls.
      
      Link: https://lkml.kernel.org/r/20221219123659.90614-1-lstoakes@gmail.com
      Signed-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      80b1d8fd
    • Jianlin Lv's avatar
      tools/vm/page_owner_sort: free memory before exit · ef1faf0e
      Jianlin Lv authored
      
      
      Although when a process terminates, the kernel will removes memory
      associated with that process, It's neither good style nor proper design to
      leave it to kernel.  This patch free allocated memory before process exit.
      
      Link: https://lkml.kernel.org/r/20221219164917.14132-1-iecedge@gmail.com
      Signed-off-by: default avatarJianlin Lv <iecedge@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ef1faf0e
    • Andrey Konovalov's avatar
      kasan: allow sampling page_alloc allocations for HW_TAGS · 44383cef
      Andrey Konovalov authored
      
      
      As Hardware Tag-Based KASAN is intended to be used in production, its
      performance impact is crucial.  As page_alloc allocations tend to be big,
      tagging and checking all such allocations can introduce a significant
      slowdown.
      
      Add two new boot parameters that allow to alleviate that slowdown:
      
      - kasan.page_alloc.sample, which makes Hardware Tag-Based KASAN tag only
        every Nth page_alloc allocation with the order configured by the second
        added parameter (default: tag every such allocation).
      
      - kasan.page_alloc.sample.order, which makes sampling enabled by the first
        parameter only affect page_alloc allocations with the order equal or
        greater than the specified value (default: 3, see below).
      
      The exact performance improvement caused by using the new parameters
      depends on their values and the applied workload.
      
      The chosen default value for kasan.page_alloc.sample.order is 3, which
      matches both PAGE_ALLOC_COSTLY_ORDER and SKB_FRAG_PAGE_ORDER.  This is
      done for two reasons:
      
      1. PAGE_ALLOC_COSTLY_ORDER is "the order at which allocations are deemed
         costly to service", which corresponds to the idea that only large and
         thus costly allocations are supposed to sampled.
      
      2. One of the workloads targeted by this patch is a benchmark that sends
         a large amount of data over a local loopback connection. Most multi-page
         data allocations in the networking subsystem have the order of
         SKB_FRAG_PAGE_ORDER (or PAGE_ALLOC_COSTLY_ORDER).
      
      When running a local loopback test on a testing MTE-enabled device in sync
      mode, enabling Hardware Tag-Based KASAN introduces a ~50% slowdown. 
      Applying this patch and setting kasan.page_alloc.sampling to a value
      higher than 1 allows to lower the slowdown.  The performance improvement
      saturates around the sampling interval value of 10 with the default
      sampling page order of 3.  This lowers the slowdown to ~20%.  The slowdown
      in real scenarios involving the network will likely be better.
      
      Enabling page_alloc sampling has a downside: KASAN misses bad accesses to
      a page_alloc allocation that has not been tagged.  This lowers the value
      of KASAN as a security mitigation.
      
      However, based on measuring the number of page_alloc allocations of
      different orders during boot in a test build, sampling with the default
      kasan.page_alloc.sample.order value affects only ~7% of allocations.  The
      rest ~93% of allocations are still checked deterministically.
      
      Link: https://lkml.kernel.org/r/129da0614123bb85ed4dd61ae30842b2dd7c903f.1671471846.git.andreyknvl@google.com
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Mark Brand <markbrand@google.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      44383cef
    • Kairui Song's avatar
      swap: avoid holding swap reference in swap_cache_get_folio · cbc2bd98
      Kairui Song authored
      
      
      All its callers either already hold a reference to, or lock the swap
      device while calling this function.  There is only one exception in
      shmem_swapin_folio, just make this caller also hold a reference of the
      swap device, so this helper can be simplified and saves a few cycles.
      
      This also provides finer control of error handling in shmem_swapin_folio,
      on race (with swap off), it can just try again.  For invalid swap entry,
      it can fail with a proper error code.
      
      Link: https://lkml.kernel.org/r/20221219185840.25441-5-ryncsn@gmail.com
      Signed-off-by: default avatarKairui Song <kasong@tencent.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cbc2bd98
    • Kairui Song's avatar
      swap: fold swap_ra_clamp_pfn into swap_ra_info · 16ba391e
      Kairui Song authored
      
      
      This makes the code cleaner.  This helper is made of only two line of self
      explanational code and not reused anywhere else.
      
      And this actually make the compiled object smaller by a bit.
      
      bloat-o-meter results on x86_64 of mm/swap_state.o:
      
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-35 (-35)
      Function                                     old     new   delta
      swap_ra_info.constprop                       512     477     -35
      Total: Before=8388, After=8353, chg -0.42%
      
      Link: https://lkml.kernel.org/r/20221219185840.25441-4-ryncsn@gmail.com
      Signed-off-by: default avatarKairui Song <kasong@tencent.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      16ba391e
    • Kairui Song's avatar
      swap: avoid a redundant pte map if ra window is 1 · 18ad72f5
      Kairui Song authored
      
      
      Avoid a redundant pte map/unmap when swap readahead window is 1.
      
      Link: https://lkml.kernel.org/r/20221219185840.25441-3-ryncsn@gmail.com
      Signed-off-by: default avatarKairui Song <kasong@tencent.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      18ad72f5
    • Kairui Song's avatar
      swapfile: get rid of volatile and avoid redundant read · 3f79b187
      Kairui Song authored
      
      
      Patch series "Clean up and fixes for swap", v2.
      
      This series cleans up some code paths, saves a few cycles and reduces the
      object size by a bit.  It also fixes some rare race issue with statistics.
      
      
      This patch (of 4):
      
      Convert a volatile variable to more readable READ_ONCE.  And this actually
      avoids the code from reading the variable twice redundantly when it races.
      
      Link: https://lkml.kernel.org/r/20221219185840.25441-1-ryncsn@gmail.com
      Link: https://lkml.kernel.org/r/20221219185840.25441-2-ryncsn@gmail.com
      Signed-off-by: default avatarKairui Song <kasong@tencent.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3f79b187
    • SeongJae Park's avatar
      Docs/ABI/damon: document scheme filters files · 497b099d
      SeongJae Park authored
      
      
      Document newly added DAMON sysfs interface files for DAMOS filtering on
      the DAMON ABI document.
      
      Link: https://lkml.kernel.org/r/20221205230830.144349-12-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      497b099d
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/usage: document DAMOS filters of sysfs · 9b7f9322
      SeongJae Park authored
      
      
      Document about the newly added files for DAMOS filters on the DAMON usage
      document.
      
      Link: https://lkml.kernel.org/r/20221205230830.144349-11-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9b7f9322
    • SeongJae Park's avatar
      selftests/damon/sysfs: test filters directory · 553b0142
      SeongJae Park authored
      
      
      Add simple test cases for scheme filters of DAMON sysfs interface.  The
      test cases check if the files are populated as expected, receives some
      valid inputs, and refuses some invalid inputs.
      
      Link: https://lkml.kernel.org/r/20221205230830.144349-10-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      553b0142
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: implement scheme filters · 29cbb9a1
      SeongJae Park authored
      
      
      Implement scheme filters functionality of DAMON sysfs interface by making
      the code reads the values of files under the filter directories and pass
      that to DAMON using DAMON kernel API.
      
      [sj@kernel.org: fix leaking a filter for wrong cgroup path]
        Link: https://lkml.kernel.org/r/20221219171807.55708-2-sj@kernel.org
      [sj@kernel.org: return an error for filter memcg path id lookup failure]
        Link: https://lkml.kernel.org/r/20221219171807.55708-3-sj@kernel.org
      Link: https://lkml.kernel.org/r/20221205230830.144349-9-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      29cbb9a1
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: connect filter directory and filters directory · 472e2b70
      SeongJae Park authored
      
      
      Implement 'nr_filters' file under 'filters' directory, which will be used
      to populate specific number of 'filter' directory under the directory,
      similar to other 'nr_*' files in DAMON sysfs interface.
      
      Link: https://lkml.kernel.org/r/20221205230830.144349-8-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      472e2b70
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: implement filter directory · 7ee161f1
      SeongJae Park authored
      
      
      Implement DAMOS filter directory which will be located under the filters
      directory.  The directory provides three files, namely type, matching, and
      memcg_path.  'type' and 'matching' will be directly connected to the
      fields of 'struct damos_filter' having same name.  'memcg_path' will
      receive the path of the memory cgroup of the interest and later converted
      to memcg id when it's committed.
      
      Link: https://lkml.kernel.org/r/20221205230830.144349-7-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7ee161f1
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: implement filters directory · ac35264b
      SeongJae Park authored
      
      
      DAMOS filters are currently supported by only DAMON kernel API.  To expose
      the feature to user space, implement a DAMON sysfs directory named
      'filters' under each scheme directory.  Please note that this is
      implementing only the directory.  Following commits will implement more
      files and directories, and finally connect the DAMOS filters feature.
      
      Link: https://lkml.kernel.org/r/20221205230830.144349-6-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ac35264b
    • SeongJae Park's avatar
      Docs/admin-guide/damon/reclaim: document 'skip_anon' parameter · d56fe242
      SeongJae Park authored
      
      
      Document the newly added 'skip_anon' parameter of DAMON_RECLAIM, which can
      be used to avoid anonymous pages reclamation.
      
      Link: https://lkml.kernel.org/r/20221205230830.144349-5-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d56fe242
    • SeongJae Park's avatar
      mm/damon/reclaim: add a parameter called skip_anon for avoiding anonymous pages reclamation · 66d9faec
      SeongJae Park authored
      
      
      In some cases, for example if users have confidence at anonymous pages
      management or the swap device is too slow, users would want to avoid
      DAMON_RECLAIM swapping the anonymous pages out.  For such case, add yet
      another DAMON_RECLAIM parameter, namely 'skip_anon'.  When it is set as
      'Y', DAMON_RECLAIM will avoid reclaiming anonymous pages using a DAMOS
      filter.
      
      Link: https://lkml.kernel.org/r/20221205230830.144349-4-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      66d9faec
    • SeongJae Park's avatar
      mm/damon/paddr: support DAMOS filters · 18250e78
      SeongJae Park authored
      
      
      Implement support of the DAMOS filters in the physical address space
      monitoring operations set, for all DAMOS actions that it supports
      including 'pageout', 'lru_prio', and 'lru_deprio'.
      
      Link: https://lkml.kernel.org/r/20221205230830.144349-3-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      18250e78
    • SeongJae Park's avatar
      mm/damon/core: implement damos filter · 98def236
      SeongJae Park authored
      
      
      Patch series "implement DAMOS filtering for anon pages and/or specific
      memory cgroups"
      
      DAMOS let users do system operations in a data access pattern oriented
      way.  The data access pattern, which is extracted by DAMON, is somewhat
      accurate more than what user space could know in many cases.  However, in
      some situation, users could know something more than the kernel about the
      pattern or some special requirements for some types of memory or
      processes.  For example, some users would have slow swap devices and knows
      latency-ciritical processes and therefore want to use DAMON-based
      proactive reclamation (DAMON_RECLAIM) for only non-anonymous pages of
      non-latency-critical processes.
      
      For such restriction, users could exclude the memory regions from the
      initial monitoring regions and use non-dynamic monitoring regions update
      monitoring operations set including fvaddr and paddr.  They could also
      adjust the DAMOS target access pattern.  For dynamically changing memory
      layout and access pattern, those would be not enough.
      
      To help the case, add an interface, namely DAMOS filters, which can be
      used to avoid the DAMOS actions be applied to specific types of memory, to
      DAMON kernel API (damon.h).  At the moment, it supports filtering
      anonymous pages and/or specific memory cgroups in or out for each DAMOS
      scheme.
      
      This patchset adds the support for all DAMOS actions that 'paddr'
      monitoring operations set supports ('pageout', 'lru_prio', and
      'lru_deprio'), and the functionality is exposed via DAMON kernel API
      (damon.h) the DAMON sysfs interface (/sys/kernel/mm/damon/admins/), and
      DAMON_RECLAIM module parameters.
      
      Patches Sequence
      ----------------
      
      First patch implements DAMOS filter interface to DAMON kernel API.  Second
      patch makes the physical address space monitoring operations set to
      support the filters from all supporting DAMOS actions.  Third patch adds
      anonymous pages filter support to DAMON_RECLAIM, and the fourth patch
      documents the DAMON_RECLAIM's new feature.  Fifth to seventh patches
      implement DAMON sysfs files for support of the filters, and eighth patch
      connects the file to use DAMOS filters feature.  Ninth patch adds simple
      self test cases for DAMOS filters of the sysfs interface.  Finally,
      following two patches (tenth and eleventh) document the new features and
      interfaces.
      
      
      This patch (of 11):
      
      DAMOS lets users do system operation in a data access pattern oriented
      way.  The data access pattern, which is extracted by DAMON, is somewhat
      accurate more than what user space could know in many cases.  However, in
      some situation, users could know something more than the kernel about the
      pattern or some special requirements for some types of memory or
      processes.  For example, some users would have slow swap devices and knows
      latency-ciritical processes and therefore want to use DAMON-based
      proactive reclamation (DAMON_RECLAIM) for only non-anonymous pages of
      non-latency-critical processes.
      
      For such restriction, users could exclude the memory regions from the
      initial monitoring regions and use non-dynamic monitoring regions update
      monitoring operations set including fvaddr and paddr.  They could also
      adjust the DAMOS target access pattern.  For dynamically changing memory
      layout and access pattern, those would be not enough.
      
      To help the case, add an interface, namely DAMOS filters, which can be
      used to avoid the DAMOS actions be applied to specific types of memory, to
      DAMON kernel API (damon.h).  At the moment, it supports filtering
      anonymous pages and/or specific memory cgroups in or out for each DAMOS
      scheme.
      
      Note that this commit adds only the interface to the DAMON kernel API. 
      The impelmentation should be made in the monitoring operations sets, and
      following commits will add that.
      
      Link: https://lkml.kernel.org/r/20221205230830.144349-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20221205230830.144349-2-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      98def236
    • Johannes Weiner's avatar
      mm: memcontrol: deprecate charge moving · da34a848
      Johannes Weiner authored
      
      
      Charge moving mode in cgroup1 allows memory to follow tasks as they
      migrate between cgroups.  This is, and always has been, a questionable
      thing to do - for several reasons.
      
      First, it's expensive.  Pages need to be identified, locked and isolated
      from various MM operations, and reassigned, one by one.
      
      Second, it's unreliable.  Once pages are charged to a cgroup, there isn't
      always a clear owner task anymore.  Cache isn't moved at all, for example.
      Mapped memory is moved - but if trylocking or isolating a page fails,
      it's arbitrarily left behind.  Frequent moving between domains may leave a
      task's memory scattered all over the place.
      
      Third, it isn't really needed.  Launcher tasks can kick off workload tasks
      directly in their target cgroup.  Using dedicated per-workload groups
      allows fine-grained policy adjustments - no need to move tasks and their
      physical pages between control domains.  The feature was never
      forward-ported to cgroup2, and it hasn't been missed.
      
      Despite it being a niche usecase, the maintenance overhead of supporting
      it is enormous.  Because pages are moved while they are live and subject
      to various MM operations, the synchronization rules are complicated. 
      There are lock_page_memcg() in MM and FS code, which non-cgroup people
      don't understand.  In some cases we've been able to shift code and cgroup
      API calls around such that we can rely on native locking as much as
      possible.  But that's fragile, and sometimes we need to hold MM locks for
      longer than we otherwise would (pte lock e.g.).
      
      Mark the feature deprecated. Hopefully we can remove it soon.
      
      And backport into -stable kernels so that people who develop against
      earlier kernels are warned about this deprecation as early as possible.
      
      [akpm@linux-foundation.org: fix memory.rst underlining]
      Link: https://lkml.kernel.org/r/Y5COd+qXwk/S+n8N@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      da34a848