Skip to content
  1. Aug 08, 2020
    • Johannes Weiner's avatar
      mm: memcontrol: restore proper dirty throttling when memory.high changes · 19ce33ac
      Johannes Weiner authored
      Commit 8c8c383c ("mm: memcontrol: try harder to set a new
      memory.high") inadvertently removed a callback to recalculate the
      writeback cache size in light of a newly configured memory.high limit.
      
      Without letting the writeback cache know about a potentially heavily
      reduced limit, it may permit too many dirty pages, which can cause
      unnecessary reclaim latencies or even avoidable OOM situations.
      
      This was spotted while reading the code, it hasn't knowingly caused any
      problems in practice so far.
      
      Fixes: 8c8c383c
      
       ("mm: memcontrol: try harder to set a new memory.high")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/20200728135210.379885-1-hannes@cmpxchg.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19ce33ac
    • Yafang Shao's avatar
      memcg, oom: check memcg margin for parallel oom · 1378b37d
      Yafang Shao authored
      
      
      Memcg oom killer invocation is synchronized by the global oom_lock and
      tasks are sleeping on the lock while somebody is selecting the victim or
      potentially race with the oom_reaper is releasing the victim's memory.
      This can result in a pointless oom killer invocation because a waiter
      might be racing with the oom_reaper
      
              P1              oom_reaper              P2
                              oom_reap_task           mutex_lock(oom_lock)
                                                      out_of_memory # no victim because we have one already
                              __oom_reap_task_mm      mute_unlock(oom_lock)
       mutex_lock(oom_lock)
                              set MMF_OOM_SKIP
       select_bad_process
       # finds a new victim
      
      The page allocator prevents from this race by trying to allocate after the
      lock can be acquired (in __alloc_pages_may_oom) which acts as a last
      minute check.  Moreover page allocator simply doesn't block on the
      oom_lock and simply retries the whole reclaim process.
      
      Memcg oom killer should do the last minute check as well.  Call
      mem_cgroup_margin to do that.  Trylock on the oom_lock could be done as
      well but this doesn't seem to be necessary at this stage.
      
      [mhocko@kernel.org: commit log]
      
      Suggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Link: http://lkml.kernel.org/r/1594735034-19190-1-git-send-email-laoar.shao@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1378b37d
    • Chris Down's avatar
      mm, memcg: decouple e{low,min} state mutations from protection checks · 45c7f7e1
      Chris Down authored
      
      
      mem_cgroup_protected currently is both used to set effective low and min
      and return a mem_cgroup_protection based on the result.  As a user, this
      can be a little unexpected: it appears to be a simple predicate function,
      if not for the big warning in the comment above about the order in which
      it must be executed.
      
      This change makes it so that we separate the state mutations from the
      actual protection checks, which makes it more obvious where we need to be
      careful mutating internal state, and where we are simply checking and
      don't need to worry about that.
      
      [mhocko@suse.com - don't check protection on root memcgs]
      
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Link: http://lkml.kernel.org/r/ff3f915097fcee9f6d7041c084ef92d16aaeb56a.1594638158.git.chris@chrisdown.name
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      45c7f7e1
    • Yafang Shao's avatar
      mm, memcg: avoid stale protection values when cgroup is above protection · 22f7496f
      Yafang Shao authored
      Patch series "mm, memcg: memory.{low,min} reclaim fix & cleanup", v4.
      
      This series contains a fix for a edge case in my earlier protection
      calculation patches, and a patch to make the area overall a little more
      robust to hopefully help avoid this in future.
      
      This patch (of 2):
      
      A cgroup can have both memory protection and a memory limit to isolate it
      from its siblings in both directions - for example, to prevent it from
      being shrunk below 2G under high pressure from outside, but also from
      growing beyond 4G under low pressure.
      
      Commit 9783aa99 ("mm, memcg: proportional memory.{low,min} reclaim")
      implemented proportional scan pressure so that multiple siblings in excess
      of their protection settings don't get reclaimed equally but instead in
      accordance to their unprotected portion.
      
      During limit reclaim, this proportionality shouldn't apply of course:
      there is no competition, all pressure is from within the cgroup and should
      be applied as such.  Reclaim should operate at full efficiency.
      
      However, mem_cgroup_protected() never expected anybody to look at the
      effective protection values when it indicated that the cgroup is above its
      protection.  As a result, a query during limit reclaim may return stale
      protection values that were calculated by a previous reclaim cycle in
      which the cgroup did have siblings.
      
      When this happens, reclaim is unnecessarily hesitant and potentially slow
      to meet the desired limit.  In theory this could lead to premature OOM
      kills, although it's not obvious this has occurred in practice.
      
      Workaround the problem by special casing reclaim roots in
      mem_cgroup_protection.  These memcgs are never participating in the
      reclaim protection because the reclaim is internal.
      
      We have to ignore effective protection values for reclaim roots because
      mem_cgroup_protected might be called from racing reclaim contexts with
      different roots.  Calculation is relying on root -> leaf tree traversal
      therefore top-down reclaim protection invariants should hold.  The only
      exception is the reclaim root which should have effective protection set
      to 0 but that would be problematic for the following setup:
      
       Let's have global and A's reclaim in parallel:
        |
        A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G)
        |\
        | C (low = 1G, usage = 2.5G)
        B (low = 1G, usage = 0.5G)
      
       for A reclaim we have
       B.elow = B.low
       C.elow = C.low
      
       For the global reclaim
       A.elow = A.low
       B.elow = min(B.usage, B.low) because children_low_usage <= A.elow
       C.elow = min(C.usage, C.low)
      
       With the effective values resetting we have A reclaim
       A.elow = 0
       B.elow = B.low
       C.elow = C.low
      
       and global reclaim could see the above and then
       B.elow = C.elow = 0 because children_low_usage > A.elow
      
      Which means that protected memcgs would get reclaimed.
      
      In future we would like to make mem_cgroup_protected more robust against
      racing reclaim contexts but that is likely more complex solution than this
      simple workaround.
      
      [hannes@cmpxchg.org - large part of the changelog]
      [mhocko@suse.com - workaround explanation]
      [chris@chrisdown.name - retitle]
      
      Fixes: 9783aa99
      
       ("mm, memcg: proportional memory.{low,min} reclaim")
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/cover.1594638158.git.chris@chrisdown.name
      Link: http://lkml.kernel.org/r/044fb8ecffd001c7905d27c0c2ad998069fdc396.1594638158.git.chris@chrisdown.name
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22f7496f
    • Chris Down's avatar
      mm, memcg: unify reclaim retry limits with page allocator · d977aa93
      Chris Down authored
      Reclaim retries have been set to 5 since the beginning of time in
      commit 66e1707b ("Memory controller: add per cgroup LRU and
      reclaim").  However, we now have a generally agreed-upon standard for
      page reclaim: MAX_RECLAIM_RETRIES (currently 16), added many years later
      in commit 0a0337e0
      
       ("mm, oom: rework oom detection").
      
      In the absence of a compelling reason to declare an OOM earlier in memcg
      context than page allocator context, it seems reasonable to supplant
      MEM_CGROUP_RECLAIM_RETRIES with MAX_RECLAIM_RETRIES, making the page
      allocator and memcg internals more similar in semantics when reclaim
      fails to produce results, avoiding premature OOMs or throttling.
      
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/da557856c9c7654308eaff4eedc1952a95e8df5f.1594640214.git.chris@chrisdown.name
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d977aa93
    • Chris Down's avatar
      mm, memcg: reclaim more aggressively before high allocator throttling · b3ff9291
      Chris Down authored
      
      
      Patch series "mm, memcg: reclaim harder before high throttling", v2.
      
      This patch (of 2):
      
      In Facebook production, we've seen cases where cgroups have been put into
      allocator throttling even when they appear to have a lot of slack file
      caches which should be trivially reclaimable.
      
      Looking more closely, the problem is that we only try a single cgroup
      reclaim walk for each return to usermode before calculating whether or not
      we should throttle.  This single attempt doesn't produce enough pressure
      to shrink for cgroups with a rapidly growing amount of file caches prior
      to entering allocator throttling.
      
      As an example, we see that threads in an affected cgroup are stuck in
      allocator throttling:
      
          # for i in $(cat cgroup.threads); do
          >     grep over_high "/proc/$i/stack"
          > done
          [<0>] mem_cgroup_handle_over_high+0x10b/0x150
          [<0>] mem_cgroup_handle_over_high+0x10b/0x150
          [<0>] mem_cgroup_handle_over_high+0x10b/0x150
      
      ...however, there is no I/O pressure reported by PSI, despite a lot of
      slack file pages:
      
          # cat memory.pressure
          some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
          full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
          # cat io.pressure
          some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
          full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
          # grep _file memory.stat
          inactive_file 1370939392
          active_file 661635072
      
      This patch changes the behaviour to retry reclaim either until the current
      task goes below the 10ms grace period, or we are making no reclaim
      progress at all.  In the latter case, we enter reclaim throttling as
      before.
      
      To a user, there's no intuitive reason for the reclaim behaviour to differ
      from hitting memory.high as part of a new allocation, as opposed to
      hitting memory.high because someone lowered its value.  As such this also
      brings an added benefit: it unifies the reclaim behaviour between the two.
      
      There's precedent for this behaviour: we already do reclaim retries when
      writing to memory.{high,max}, in max reclaim, and in the page allocator
      itself.
      
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/cover.1594640214.git.chris@chrisdown.name
      Link: http://lkml.kernel.org/r/a4e23b59e9ef499b575ae73a8120ee089b7d3373.1594640214.git.chris@chrisdown.name
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3ff9291
    • Roman Gushchin's avatar
      mm: memcontrol: avoid workload stalls when lowering memory.high · 536d3bf2
      Roman Gushchin authored
      
      
      Memory.high limit is implemented in a way such that the kernel penalizes
      all threads which are allocating a memory over the limit.  Forcing all
      threads into the synchronous reclaim and adding some artificial delays
      allows to slow down the memory consumption and potentially give some time
      for userspace oom handlers/resource control agents to react.
      
      It works nicely if the memory usage is hitting the limit from below,
      however it works sub-optimal if a user adjusts memory.high to a value way
      below the current memory usage.  It basically forces all workload threads
      (doing any memory allocations) into the synchronous reclaim and sleep.
      This makes the workload completely unresponsive for a long period of time
      and can also lead to a system-wide contention on lru locks.  It can happen
      even if the workload is not actually tight on memory and has, for example,
      a ton of cold pagecache.
      
      In the current implementation writing to memory.high causes an atomic
      update of page counter's high value followed by an attempt to reclaim
      enough memory to fit into the new limit.  To fix the problem described
      above, all we need is to change the order of execution: try to push the
      memory usage under the limit first, and only then set the new high limit.
      
      Reported-by: default avatarDomas Mituzas <domas@fb.com>
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Chris Down <chris@chrisdown.name>
      Link: http://lkml.kernel.org/r/20200709194718.189231-1-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      536d3bf2
    • Roman Gushchin's avatar
      mm: kmem: switch to static_branch_likely() in memcg_kmem_enabled() · eda330e5
      Roman Gushchin authored
      
      
      Currently memcg_kmem_enabled() is optimized for the kernel memory
      accounting being off.  It was so for a long time, and arguably the reason
      behind was that the kernel memory accounting was initially an opt-in
      feature.  However, now it's on by default on both cgroup v1 and cgroup v2,
      and it's on for all cgroups.  So let's switch over to
      static_branch_likely() to reflect this fact.
      
      Unlikely there is a significant performance difference, as the cost of a
      memory allocation and its accounting significantly exceeds the cost of a
      jump.  However, the conversion makes the code look more logically.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Link: http://lkml.kernel.org/r/20200707173612.124425-3-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eda330e5
    • Roman Gushchin's avatar
      mm: slab: rename (un)charge_slab_page() to (un)account_slab_page() · 74d555be
      Roman Gushchin authored
      
      
      charge_slab_page() and uncharge_slab_page() are not related anymore to
      memcg charging and uncharging.  In order to make their names less
      confusing, let's rename them to account_slab_page() and
      unaccount_slab_page() respectively.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Link: http://lkml.kernel.org/r/20200707173612.124425-2-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74d555be
    • Roman Gushchin's avatar
      mm: memcg/slab: remove unused argument by charge_slab_page() · 84950480
      Roman Gushchin authored
      
      
      charge_slab_page() is not using the gfp argument anymore,
      remove it.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Link: http://lkml.kernel.org/r/20200707173612.124425-1-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84950480
    • Shakeel Butt's avatar
      mm: memcontrol: account kernel stack per node · 991e7673
      Shakeel Butt authored
      
      
      Currently the kernel stack is being accounted per-zone.  There is no need
      to do that.  In addition due to being per-zone, memcg has to keep a
      separate MEMCG_KERNEL_STACK_KB.  Make the stat per-node and deprecate
      MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
      node_stat_item.  In addition localize the kernel stack stats updates to
      account_kernel_stack().
      
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      991e7673
    • Roman Gushchin's avatar
      tools/cgroup: add memcg_slabinfo.py tool · fbc1ac9d
      Roman Gushchin authored
      
      
      Add a drgn-based tool to display slab information for a given memcg.  Can
      replace cgroup v1 memory.kmem.slabinfo interface on cgroup v2, but in a
      more flexiable way.
      
      Currently supports only SLUB configuration, but SLAB can be trivially
      added later.
      
      Output example:
      $ sudo ./tools/cgroup/memcg_slabinfo.py /sys/fs/cgroup/user.slice/user-111017.slice/user\@111017.service
      shmem_inode_cache     92     92    704   46    8 : tunables    0    0    0 : slabdata      2      2      0
      eventpoll_pwq         56     56     72   56    1 : tunables    0    0    0 : slabdata      1      1      0
      eventpoll_epi         32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
      kmalloc-8              0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
      kmalloc-96             0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
      kmalloc-2048           0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
      kmalloc-64           128    128     64   64    1 : tunables    0    0    0 : slabdata      2      2      0
      mm_struct            160    160   1024   32    8 : tunables    0    0    0 : slabdata      5      5      0
      signal_cache          96     96   1024   32    8 : tunables    0    0    0 : slabdata      3      3      0
      sighand_cache         45     45   2112   15    8 : tunables    0    0    0 : slabdata      3      3      0
      files_cache          138    138    704   46    8 : tunables    0    0    0 : slabdata      3      3      0
      task_delay_info      153    153     80   51    1 : tunables    0    0    0 : slabdata      3      3      0
      task_struct           27     27   3520    9    8 : tunables    0    0    0 : slabdata      3      3      0
      radix_tree_node       56     56    584   28    4 : tunables    0    0    0 : slabdata      2      2      0
      btrfs_inode          140    140   1136   28    8 : tunables    0    0    0 : slabdata      5      5      0
      kmalloc-1024          64     64   1024   32    8 : tunables    0    0    0 : slabdata      2      2      0
      kmalloc-192           84     84    192   42    2 : tunables    0    0    0 : slabdata      2      2      0
      inode_cache           54     54    600   27    4 : tunables    0    0    0 : slabdata      2      2      0
      kmalloc-128            0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
      kmalloc-512           32     32    512   32    4 : tunables    0    0    0 : slabdata      1      1      0
      skbuff_head_cache     32     32    256   32    2 : tunables    0    0    0 : slabdata      1      1      0
      sock_inode_cache      46     46    704   46    8 : tunables    0    0    0 : slabdata      1      1      0
      cred_jar             378    378    192   42    2 : tunables    0    0    0 : slabdata      9      9      0
      proc_inode_cache      96     96    672   24    4 : tunables    0    0    0 : slabdata      4      4      0
      dentry               336    336    192   42    2 : tunables    0    0    0 : slabdata      8      8      0
      filp                 697    864    256   32    2 : tunables    0    0    0 : slabdata     27     27      0
      anon_vma             644    644     88   46    1 : tunables    0    0    0 : slabdata     14     14      0
      pid                 1408   1408     64   64    1 : tunables    0    0    0 : slabdata     22     22      0
      vm_area_struct      1200   1200    200   40    2 : tunables    0    0    0 : slabdata     30     30      0
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-20-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fbc1ac9d
    • Roman Gushchin's avatar
      kselftests: cgroup: add kernel memory accounting tests · 933dc80e
      Roman Gushchin authored
      
      
      Add some tests to cover the kernel memory accounting functionality.  These
      are covering some issues (and changes) we had recently.
      
      1) A test which allocates a lot of negative dentries, checks memcg slab
         statistics, creates memory pressure by setting memory.max to some low
         value and checks that some number of slabs was reclaimed.
      
      2) A test which covers side effects of memcg destruction: it creates
         and destroys a large number of sub-cgroups, each containing a
         multi-threaded workload which allocates and releases some kernel
         memory.  Then it checks that the charge ans memory.stats do add up on
         the parent level.
      
      3) A test which reads /proc/kpagecgroup and implicitly checks that it
         doesn't crash the system.
      
      4) A test which spawns a large number of threads and checks that the
         kernel stacks accounting works as expected.
      
      5) A test which checks that living charged slab objects are not
         preventing the memory cgroup from being released after being deleted by
         a user.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-19-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      933dc80e
    • Roman Gushchin's avatar
      mm: memcg/slab: use a single set of kmem_caches for all allocations · 10befea9
      Roman Gushchin authored
      
      
      Instead of having two sets of kmem_caches: one for system-wide and
      non-accounted allocations and the second one shared by all accounted
      allocations, we can use just one.
      
      The idea is simple: space for obj_cgroup metadata can be allocated on
      demand and filled only for accounted allocations.
      
      It allows to remove a bunch of code which is required to handle kmem_cache
      clones for accounted allocations.  There is no more need to create them,
      accumulate statistics, propagate attributes, etc.  It's a quite
      significant simplification.
      
      Also, because the total number of slab_caches is reduced almost twice (not
      all kmem_caches have a memcg clone), some additional memory savings are
      expected.  On my devvm it additionally saves about 3.5% of slab memory.
      
      [guro@fb.com: fix build on MIPS]
        Link: http://lkml.kernel.org/r/20200717214810.3733082-1-guro@fb.com
      
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-18-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10befea9
    • Roman Gushchin's avatar
      mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() · 15999eef
      Roman Gushchin authored
      
      
      memcg_accumulate_slabinfo() is never called with a non-root kmem_cache as
      a first argument, so the is_root_cache(s) check is redundant and can be
      removed without any functional change.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-17-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15999eef
    • Roman Gushchin's avatar
      mm: memcg/slab: deprecate slab_root_caches · c7094406
      Roman Gushchin authored
      
      
      Currently there are two lists of kmem_caches:
      1) slab_caches, which contains all kmem_caches,
      2) slab_root_caches, which contains only root kmem_caches.
      
      And there is some preprocessor magic to have a single list if
      CONFIG_MEMCG_KMEM isn't enabled.
      
      It was required earlier because the number of non-root kmem_caches was
      proportional to the number of memory cgroups and could reach really big
      values.  Now, when it cannot exceed the number of root kmem_caches, there
      is really no reason to maintain two lists.
      
      We never iterate over the slab_root_caches list on any hot paths, so it's
      perfectly fine to iterate over slab_caches and filter out non-root
      kmem_caches.
      
      It allows to remove a lot of config-dependent code and two pointers from
      the kmem_cache structure.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-16-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7094406
    • Roman Gushchin's avatar
      mm: memcg/slab: remove memcg_kmem_get_cache() · 272911a4
      Roman Gushchin authored
      
      
      The memcg_kmem_get_cache() function became really trivial, so let's just
      inline it into the single call point: memcg_slab_pre_alloc_hook().
      
      It will make the code less bulky and can also help the compiler to
      generate a better code.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-15-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      272911a4
    • Roman Gushchin's avatar
      mm: memcg/slab: simplify memcg cache creation · d797b7d0
      Roman Gushchin authored
      
      
      Because the number of non-root kmem_caches doesn't depend on the number of
      memory cgroups anymore and is generally not very big, there is no more
      need for a dedicated workqueue.
      
      Also, as there is no more need to pass any arguments to the
      memcg_create_kmem_cache() except the root kmem_cache, it's possible to
      just embed the work structure into the kmem_cache and avoid the dynamic
      allocation of the work structure.
      
      This will also simplify the synchronization: for each root kmem_cache
      there is only one work.  So there will be no more concurrent attempts to
      create a non-root kmem_cache for a root kmem_cache: the second and all
      following attempts to queue the work will fail.
      
      On the kmem_cache destruction path there is no more need to call the
      expensive flush_workqueue() and wait for all pending works to be finished.
      Instead, cancel_work_sync() can be used to cancel/wait for only one work.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-14-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d797b7d0
    • Roman Gushchin's avatar
      mm: memcg/slab: use a single set of kmem_caches for all accounted allocations · 9855609b
      Roman Gushchin authored
      
      
      This is fairly big but mostly red patch, which makes all accounted slab
      allocations use a single set of kmem_caches instead of creating a separate
      set for each memory cgroup.
      
      Because the number of non-root kmem_caches is now capped by the number of
      root kmem_caches, there is no need to shrink or destroy them prematurely.
      They can be perfectly destroyed together with their root counterparts.
      This allows to dramatically simplify the management of non-root
      kmem_caches and delete a ton of code.
      
      This patch performs the following changes:
      1) introduces memcg_params.memcg_cache pointer to represent the
         kmem_cache which will be used for all non-root allocations
      2) reuses the existing memcg kmem_cache creation mechanism
         to create memcg kmem_cache on the first allocation attempt
      3) memcg kmem_caches are named <kmemcache_name>-memcg,
         e.g. dentry-memcg
      4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
         or schedule it's creation and return the root cache
      5) removes almost all non-root kmem_cache management code
         (separate refcounter, reparenting, shrinking, etc)
      6) makes slab debugfs to display root_mem_cgroup css id and never
         show :dead and :deact flags in the memcg_slabinfo attribute.
      
      Following patches in the series will simplify the kmem_cache creation.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-13-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9855609b
    • Roman Gushchin's avatar
      mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h · 0f876e4d
      Roman Gushchin authored
      
      
      To make the memcg_kmem_bypass() function available outside of the
      memcontrol.c, let's move it to memcontrol.h.  The function is small and
      nicely fits into static inline sort of functions.
      
      It will be used from the slab code.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-12-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f876e4d
    • Roman Gushchin's avatar
      mm: memcg/slab: deprecate memory.kmem.slabinfo · 4330a26b
      Roman Gushchin authored
      
      
      Deprecate memory.kmem.slabinfo.
      
      An empty file will be presented if corresponding config options are
      enabled.
      
      The interface is implementation dependent, isn't present in cgroup v2, and
      is generally useful only for core mm debugging purposes.  In other words,
      it doesn't provide any value for the absolute majority of users.
      
      A drgn-based replacement can be found in
      tools/cgroup/memcg_slabinfo.py.  It does support cgroup v1 and v2,
      mimics memory.kmem.slabinfo output and also allows to get any
      additional information without a need to recompile the kernel.
      
      If a drgn-based solution is too slow for a task, a bpf-based tracing tool
      can be used, which can easily keep track of all slab allocations belonging
      to a memory cgroup.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-11-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4330a26b
    • Roman Gushchin's avatar
      mm: memcg/slab: charge individual slab objects instead of pages · f2fe7b09
      Roman Gushchin authored
      
      
      Switch to per-object accounting of non-root slab objects.
      
      Charging is performed using obj_cgroup API in the pre_alloc hook.
      Obj_cgroup is charged with the size of the object and the size of
      metadata: as now it's the size of an obj_cgroup pointer.  If the amount of
      memory has been charged successfully, the actual allocation code is
      executed.  Otherwise, -ENOMEM is returned.
      
      In the post_alloc hook if the actual allocation succeeded, corresponding
      vmstats are bumped and the obj_cgroup pointer is saved.  Otherwise, the
      charge is canceled.
      
      On the free path obj_cgroup pointer is obtained and used to uncharge the
      size of the releasing object.
      
      Memcg and lruvec counters are now representing only memory used by active
      slab objects and do not include the free space.  The free space is shared
      and doesn't belong to any specific cgroup.
      
      Global per-node slab vmstats are still modified from
      (un)charge_slab_page() functions.  The idea is to keep all slab pages
      accounted as slab pages on system level.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-10-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f2fe7b09
    • Roman Gushchin's avatar
      mm: memcg/slab: save obj_cgroup for non-root slab objects · 964d4bd3
      Roman Gushchin authored
      
      
      Store the obj_cgroup pointer in the corresponding place of
      page->obj_cgroups for each allocated non-root slab object.  Make sure that
      each allocated object holds a reference to obj_cgroup.
      
      Objcg pointer is obtained from the memcg->objcg dereferencing in
      memcg_kmem_get_cache() and passed from pre_alloc_hook to post_alloc_hook.
      Then in case of successful allocation(s) it's getting stored in the
      page->obj_cgroups vector.
      
      The objcg obtaining part look a bit bulky now, but it will be simplified
      by next commits in the series.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-9-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      964d4bd3
    • Roman Gushchin's avatar
      mm: memcg/slab: allocate obj_cgroups for non-root slab pages · 286e04b8
      Roman Gushchin authored
      
      
      Allocate and release memory to store obj_cgroup pointers for each non-root
      slab page. Reuse page->mem_cgroup pointer to store a pointer to the
      allocated space.
      
      This commit temporarily increases the memory footprint of the kernel memory
      accounting. To store obj_cgroup pointers we'll need a place for an
      objcg_pointer for each allocated object. However, the following patches
      in the series will enable sharing of slab pages between memory cgroups,
      which will dramatically increase the total slab utilization. And the final
      memory footprint will be significantly smaller than before.
      
      To distinguish between obj_cgroups and memcg pointers in case when it's
      not obvious which one is used (as in page_cgroup_ino()), let's always set
      the lowest bit in the obj_cgroup case. The original obj_cgroups
      pointer is marked to be ignored by kmemleak, which otherwise would
      report a memory leak for each allocated vector.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-8-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      286e04b8
    • Roman Gushchin's avatar
      mm: memcg/slab: obj_cgroup API · bf4f0599
      Roman Gushchin authored
      
      
      Obj_cgroup API provides an ability to account sub-page sized kernel
      objects, which potentially outlive the original memory cgroup.
      
      The top-level API consists of the following functions:
        bool obj_cgroup_tryget(struct obj_cgroup *objcg);
        void obj_cgroup_get(struct obj_cgroup *objcg);
        void obj_cgroup_put(struct obj_cgroup *objcg);
      
        int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
        void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
      
        struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);
        struct obj_cgroup *get_obj_cgroup_from_current(void);
      
      Object cgroup is basically a pointer to a memory cgroup with a per-cpu
      reference counter.  It substitutes a memory cgroup in places where it's
      necessary to charge a custom amount of bytes instead of pages.
      
      All charged memory rounded down to pages is charged to the corresponding
      memory cgroup using __memcg_kmem_charge().
      
      It implements reparenting: on memcg offlining it's getting reattached to
      the parent memory cgroup.  Each online memory cgroup has an associated
      active object cgroup to handle new allocations and the list of all
      attached object cgroups.  On offlining of a cgroup this list is reparented
      and for each object cgroup in the list the memcg pointer is swapped to the
      parent memory cgroup.  It prevents long-living objects from pinning the
      original memory cgroup in the memory.
      
      The implementation is based on byte-sized per-cpu stocks.  A sub-page
      sized leftover is stored in an atomic field, which is a part of obj_cgroup
      object.  So on cgroup offlining the leftover is automatically reparented.
      
      memcg->objcg is rcu protected.  objcg->memcg is a raw pointer, which is
      always pointing at a memory cgroup, but can be atomically swapped to the
      parent memory cgroup.  So a user must ensure the lifetime of the
      cgroup, e.g.  grab rcu_read_lock or css_set_lock.
      
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-7-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf4f0599
    • Johannes Weiner's avatar
      mm: memcontrol: decouple reference counting from page accounting · 1a3e1f40
      Johannes Weiner authored
      The reference counting of a memcg is currently coupled directly to how
      many 4k pages are charged to it.  This doesn't work well with Roman's new
      slab controller, which maintains pools of objects and doesn't want to keep
      an extra balance sheet for the pages backing those objects.
      
      This unusual refcounting design (reference counts usually track pointers
      to an object) is only for historical reasons: memcg used to not take any
      css references and simply stalled offlining until all charges had been
      reparented and the page counters had dropped to zero.  When we got rid of
      the reparenting requirement, the simple mechanical translation was to take
      a reference for every charge.
      
      More historical context can be found in commit e8ea14cc ("mm:
      memcontrol: take a css reference for each charged page"), commit
      64f21993 ("mm: memcontrol: remove obsolete kmemcg pinning tricks") and
      commit b2052564
      
       ("mm: memcontrol: continue cache reclaim from offlined
      groups").
      
      The new slab controller exposes the limitations in this scheme, so let's
      switch it to a more idiomatic reference counting model based on actual
      kernel pointers to the memcg:
      
      - The per-cpu stock holds a reference to the memcg its caching
      
      - User pages hold a reference for their page->mem_cgroup. Transparent
        huge pages will no longer acquire tail references in advance, we'll
        get them if needed during the split.
      
      - Kernel pages hold a reference for their page->mem_cgroup
      
      - Pages allocated in the root cgroup will acquire and release css
        references for simplicity. css_get() and css_put() optimize that.
      
      - The current memcg_charge_slab() already hacked around the per-charge
        references; this change gets rid of that as well.
      
      - tcp accounting will handle reference in mem_cgroup_sk_{alloc,free}
      
      Roman:
      1) Rebased on top of the current mm tree: added css_get() in
         mem_cgroup_charge(), dropped mem_cgroup_try_charge() part
      2) I've reformatted commit references in the commit log to make
         checkpatch.pl happy.
      
      [hughd@google.com: remove css_put_many() from __mem_cgroup_clear_mc()]
        Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2007302011450.2347@eggly.anvils
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-6-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a3e1f40
    • Roman Gushchin's avatar
      mm: slub: implement SLUB version of obj_to_index() · 4138fdfc
      Roman Gushchin authored
      This commit implements SLUB version of the obj_to_index() function, which
      will be required to calculate the offset of obj_cgroup in the obj_cgroups
      vector to store/obtain the objcg ownership data.
      
      To make it faster, let's repeat the SLAB's trick introduced by commit
      6a2d7a95
      
       ("SLAB: use a multiply instead of a divide in
      obj_to_index()") and avoid an expensive division.
      
      Vlastimil Babka noticed, that SLUB does have already a similar function
      called slab_index(), which is defined only if SLUB_DEBUG is enabled.  The
      function does a similar math, but with a division, and it also takes a
      page address instead of a page pointer.
      
      Let's remove slab_index() and replace it with the new helper
      __obj_to_index(), which takes a page address.  obj_to_index() will be a
      simple wrapper taking a page pointer and passing page_address(page) into
      __obj_to_index().
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-5-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4138fdfc
    • Roman Gushchin's avatar
      mm: memcg: convert vmstat slab counters to bytes · d42f3245
      Roman Gushchin authored
      
      
      In order to prepare for per-object slab memory accounting, convert
      NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.
      
      To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
      NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).
      
      Internally global and per-node counters are stored in pages, however memcg
      and lruvec counters are stored in bytes.  This scheme may look weird, but
      only for now.  As soon as slab pages will be shared between multiple
      cgroups, global and node counters will reflect the total number of slab
      pages.  However memcg and lruvec counters will be used for per-memcg slab
      memory tracking, which will take separate kernel objects in the account.
      Keeping global and node counters in pages helps to avoid additional
      overhead.
      
      The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
      will fit into atomic_long_t we use for vmstats.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d42f3245
    • Roman Gushchin's avatar
      mm: memcg: prepare for byte-sized vmstat items · ea426c2a
      Roman Gushchin authored
      
      
      To implement per-object slab memory accounting, we need to convert slab
      vmstat counters to bytes.  Actually, out of 4 levels of counters: global,
      per-node, per-memcg and per-lruvec only two last levels will require
      byte-sized counters.  It's because global and per-node counters will be
      counting the number of slab pages, and per-memcg and per-lruvec will be
      counting the amount of memory taken by charged slab objects.
      
      Converting all vmstat counters to bytes or even all slab counters to bytes
      would introduce an additional overhead.  So instead let's store global and
      per-node counters in pages, and memcg and lruvec counters in bytes.
      
      To make the API clean all access helpers (both on the read and write
      sides) are dealing with bytes.
      
      To avoid back-and-forth conversions a new flavor of read-side helpers is
      introduced, which always returns values in pages: node_page_state_pages()
      and global_node_page_state_pages().
      
      Actually new helpers are just reading raw values.  Old helpers are simple
      wrappers, which will complain on an attempt to read byte value, because at
      the moment no one actually needs bytes.
      
      Thanks to Johannes Weiner for the idea of having the byte-sized API on top
      of the page-sized internal storage.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-3-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea426c2a
    • Roman Gushchin's avatar
      mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() · eedc4e5a
      Roman Gushchin authored
      
      
      Patch series "The new cgroup slab memory controller", v7.
      
      The patchset moves the accounting from the page level to the object level.
      It allows to share slab pages between memory cgroups.  This leads to a
      significant win in the slab utilization (up to 45%) and the corresponding
      drop in the total kernel memory footprint.  The reduced number of
      unmovable slab pages should also have a positive effect on the memory
      fragmentation.
      
      The patchset makes the slab accounting code simpler: there is no more need
      in the complicated dynamic creation and destruction of per-cgroup slab
      caches, all memory cgroups use a global set of shared slab caches.  The
      lifetime of slab caches is not more connected to the lifetime of memory
      cgroups.
      
      The more precise accounting does require more CPU, however in practice the
      difference seems to be negligible.  We've been using the new slab
      controller in Facebook production for several months with different
      workloads and haven't seen any noticeable regressions.  What we've seen
      were memory savings in order of 1 GB per host (it varied heavily depending
      on the actual workload, size of RAM, number of CPUs, memory pressure,
      etc).
      
      The third version of the patchset added yet another step towards the
      simplification of the code: sharing of slab caches between accounted and
      non-accounted allocations.  It comes with significant upsides (most
      noticeable, a complete elimination of dynamic slab caches creation) but
      not without some regression risks, so this change sits on top of the
      patchset and is not completely merged in.  So in the unlikely event of a
      noticeable performance regression it can be reverted separately.
      
      The slab memory accounting works in exactly the same way for SLAB and
      SLUB.  With both allocators the new controller shows significant memory
      savings, with SLUB the difference is bigger.  On my 16-core desktop
      machine running Fedora 32 the size of the slab memory measured after the
      start of the system was lower by 58% and 38% with SLUB and SLAB
      correspondingly.
      
      As an estimation of a potential CPU overhead, below are results of
      slab_bulk_test01 test, kindly provided by Jesper D.  Brouer.  He also
      helped with the evaluation of results.
      
      The test can be found here: https://github.com/netoptimizer/prototype-kernel/
      The smallest number in each row should be picked for a comparison.
      
      SLUB-patched - bulk-API
       - SLUB-patched : bulk_quick_reuse objects=1 : 187 -  90 - 224  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=2 : 110 -  53 - 133  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=3 :  88 -  95 -  42  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=4 :  91 -  85 -  36  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=8 :  32 -  66 -  32  cycles(tsc)
      
      SLUB-original -  bulk-API
       - SLUB-original: bulk_quick_reuse objects=1 :  87 -  87 - 142  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=2 :  52 -  53 -  53  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=3 :  42 -  42 -  91  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=4 :  91 -  37 -  37  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=8 :  31 -  79 -  76  cycles(tsc)
      
      SLAB-patched -  bulk-API
       - SLAB-patched : bulk_quick_reuse objects=1 :  67 -  67 - 140  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=2 :  55 -  46 -  46  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=3 :  93 -  94 -  39  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=4 :  35 -  88 -  85  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=8 :  30 -  30 -  30  cycles(tsc)
      
      SLAB-original-  bulk-API
       - SLAB-original: bulk_quick_reuse objects=1 : 143 - 136 -  67  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=2 :  45 -  46 -  46  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=3 :  38 -  39 -  39  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=4 :  35 -  87 -  87  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=8 :  29 -  66 -  30  cycles(tsc)
      
      This patch (of 19):
      
      To convert memcg and lruvec slab counters to bytes there must be a way to
      change these counters without touching node counters.  Factor out
      __mod_memcg_lruvec_state() out of __mod_lruvec_state().
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20200623174037.3951353-2-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eedc4e5a
    • Roman Gushchin's avatar
      mm: kmem: make memcg_kmem_enabled() irreversible · d648bcc7
      Roman Gushchin authored
      
      
      Historically the kernel memory accounting was an opt-in feature, which
      could be enabled for individual cgroups.  But now it's not true, and it's
      on by default both on cgroup v1 and cgroup v2.  And as long as a user has
      at least one non-root memory cgroup, the kernel memory accounting is on.
      So in most setups it's either always on (if memory cgroups are in use and
      kmem accounting is not disabled), either always off (otherwise).
      
      memcg_kmem_enabled() is used in many places to guard the kernel memory
      accounting code.  If memcg_kmem_enabled() can reverse from returning true
      to returning false (as now), we can't rely on it on release paths and have
      to check if it was on before.
      
      If we'll make memcg_kmem_enabled() irreversible (always returning true
      after returning it for the first time), it'll make the general logic more
      simple and robust.  It also will allow to guard some checks which
      otherwise would stay unguarded.
      
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200702180926.1330769-1-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d648bcc7
    • Chris Down's avatar
      tmpfs: support 64-bit inums per-sb · ea3271f7
      Chris Down authored
      
      
      The default is still set to inode32 for backwards compatibility, but
      system administrators can opt in to the new 64-bit inode numbers by
      either:
      
      1. Passing inode64 on the command line when mounting, or
      2. Configuring the kernel with CONFIG_TMPFS_INODE64=y
      
      The inode64 and inode32 names are used based on existing precedent from
      XFS.
      
      [hughd@google.com: Kconfig fixes]
        Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008011928010.13320@eggly.anvils
      
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/8b23758d0c66b5e2263e08baf9c4b6a7565cbd8f.1594661218.git.chris@chrisdown.name
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea3271f7
    • Chris Down's avatar
      tmpfs: per-superblock i_ino support · e809d5f0
      Chris Down authored
      
      
      Patch series "tmpfs: inode: Reduce risk of inum overflow", v7.
      
      In Facebook production we are seeing heavy i_ino wraparounds on tmpfs.  On
      affected tiers, in excess of 10% of hosts show multiple files with
      different content and the same inode number, with some servers even having
      as many as 150 duplicated inode numbers with differing file content.
      
      This causes actual, tangible problems in production.  For example, we have
      complaints from those working on remote caches that their application is
      reporting cache corruptions because it uses (device, inodenum) to
      establish the identity of a particular cache object, but because it's not
      unique any more, the application refuses to continue and reports cache
      corruption.  Even worse, sometimes applications may not even detect the
      corruption but may continue anyway, causing phantom and hard to debug
      behaviour.
      
      In general, userspace applications expect that (device, inodenum) should
      be enough to be uniquely point to one inode, which seems fair enough.  One
      might also need to check the generation, but in this case:
      
      1. That's not currently exposed to userspace
         (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
      2. Even with generation, there shouldn't be two live inodes with the
         same inode number on one device.
      
      In order to mitigate this, we take a two-pronged approach:
      
      1. Moving inum generation from being global to per-sb for tmpfs. This
         itself allows some reduction in i_ino churn. This works on both 64-
         and 32- bit machines.
      2. Adding inode{64,32} for tmpfs. This fix is supported on machines with
         64-bit ino_t only: we allow users to mount tmpfs with a new inode64
         option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.
      
      You can see how this compares to previous related patches which didn't
      implement this per-superblock:
      
      - https://patchwork.kernel.org/patch/11254001/
      - https://patchwork.kernel.org/patch/11023915/
      
      This patch (of 2):
      
      get_next_ino has a number of problems:
      
      - It uses and returns a uint, which is susceptible to become overflowed
        if a lot of volatile inodes that use get_next_ino are created.
      - It's global, with no specificity per-sb or even per-filesystem. This
        means it's not that difficult to cause inode number wraparounds on a
        single device, which can result in having multiple distinct inodes
        with the same inode number.
      
      This patch adds a per-superblock counter that mitigates the second case.
      This design also allows us to later have a specific i_ino size per-device,
      for example, allowing users to choose whether to use 32- or 64-bit inodes
      for each tmpfs mount.  This is implemented in the next commit.
      
      For internal shmem mounts which may be less tolerant to spinlock delays,
      we implement a percpu batching scheme which only takes the stat_lock at
      each batch boundary.
      
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/cover.1594661218.git.chris@chrisdown.name
      Link: http://lkml.kernel.org/r/1986b9d63b986f08ec07a4aa4b2275e718e47d8a.1594661218.git.chris@chrisdown.name
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e809d5f0
    • Xianting Tian's avatar
      mm/page_io.c: use blk_io_schedule() for avoiding task hung in sync io · 0f190a7a
      Xianting Tian authored
      
      
      swap_readpage() does the sync io for one page, the io is not big,
      normally, the io can be finished quickly, but it may take long time or
      wait forever in case of io failure or discard.
      
      This patch uses blk_io_schedule() instead of io_schedule() to avoid task
      hung and crash (when set /proc/sys/kernel/hung_task_panic) when the above
      exception occurs.
      
      This is similar to the hung task avoidance in submit_bio_wait(),
      blk_execute_rq() and __blkdev_direct_IO().
      
      Signed-off-by: default avatarXianting Tian <xianting_tian@126.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/1596461807-21087-1-git-send-email-xianting_tian@126.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f190a7a
    • Krzysztof Kozlowski's avatar
      mm: swap: fix kerneldoc of swap_vma_readahead() · 27ec4878
      Krzysztof Kozlowski authored
      
      
      Fix W=1 compile warnings (invalid kerneldoc):
      
          mm/swap_state.c:742: warning: Function parameter or member 'fentry' not described in 'swap_vma_readahead'
          mm/swap_state.c:742: warning: Excess function parameter 'entry' description in 'swap_vma_readahead'
      
      Signed-off-by: default avatarKrzysztof Kozlowski <krzk@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200728171109.28687-2-krzk@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27ec4878
    • Zhen Lei's avatar
      mm/swap_slots.c: remove redundant check for swap_slot_cache_initialized · e0f3ebba
      Zhen Lei authored
      
      
      Because enable_swap_slots_cache can only become true in
      enable_swap_slots_cache(), and depends on swap_slot_cache_initialized is
      true before.  That means, when enable_swap_slots_cache is true,
      swap_slot_cache_initialized is true also.
      
      So the condition:
      "swap_slot_cache_enabled && swap_slot_cache_initialized"
      can be reduced to "swap_slot_cache_enabled"
      
      And in mathematics:
      "!swap_slot_cache_enabled || !swap_slot_cache_initialized"
      is equal to "!(swap_slot_cache_enabled && swap_slot_cache_initialized)"
      
      So no functional change.
      
      Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200430061143.450-4-thunder.leizhen@huawei.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0f3ebba
    • Zhen Lei's avatar
      mm/swap_slots.c: simplify enable_swap_slots_cache() · d69a9575
      Zhen Lei authored
      
      
      Whether swap_slot_cache_initialized is true or false,
      __reenable_swap_slots_cache() is always called.  To make this meaning
      clear, leave only one call to __reenable_swap_slots_cache().  This also
      make it clearer what extra needs be done when swap_slot_cache_initialized
      is false.
      
      No functional change.
      
      Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200430061143.450-3-thunder.leizhen@huawei.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d69a9575
    • Zhen Lei's avatar
      mm/swap_slots.c: simplify alloc_swap_slot_cache() · f90eae2a
      Zhen Lei authored
      
      
      Patch series "clean up some functions in mm/swap_slots.c".
      
      When I studied the code of mm/swap_slots.c, I found some places can be
      improved.
      
      This patch (of 3):
      
      Both "slots" and "slots_ret" are only need to be freed when cache already
      allocated.  Make them closer, seems more clear.
      
      No functional change.
      
      Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200430061143.450-1-thunder.leizhen@huawei.com
      Link: http://lkml.kernel.org/r/20200430061143.450-2-thunder.leizhen@huawei.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f90eae2a
    • Tang Yizhou's avatar
      mm/gup.c: fix the comment of return value for populate_vma_page_range() · 0a36f7f8
      Tang Yizhou authored
      
      
      The return value of populate_vma_page_range() is consistent with
      __get_user_pages(), and so is the function comment of return value.
      
      Signed-off-by: default avatarTang Yizhou <tangyizhou@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Link: http://lkml.kernel.org/r/20200720034303.29920-1-tangyizhou@huawei.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a36f7f8
    • Yang Shi's avatar
      mm: filemap: add missing FGP_ flags in kerneldoc comment for pagecache_get_page · 605cad83
      Yang Shi authored
      
      
      FGP_{WRITE|NOFS|NOWAIT} were missed in pagecache_get_page's kerneldoc
      comment.
      
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Gang Deng <gavin.dg@linux.alibaba.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/1593031747-4249-1-git-send-email-yang.shi@linux.alibaba.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      605cad83