Skip to content
  1. May 01, 2021
    • Muchun Song's avatar
      mm: memcontrol: slab: fix obtain a reference to a freeing memcg · 9f38f03a
      Muchun Song authored
      Patch series "Use obj_cgroup APIs to charge kmem pages", v5.
      
      Since Roman's series "The new cgroup slab memory controller" applied.
      All slab objects are charged with the new APIs of obj_cgroup.  The new
      APIs introduce a struct obj_cgroup to charge slab objects.  It prevents
      long-living objects from pinning the original memory cgroup in the
      memory.  But there are still some corner objects (e.g.  allocations
      larger than order-1 page on SLUB) which are not charged with the new
      APIs.  Those objects (include the pages which are allocated from buddy
      allocator directly) are charged as kmem pages which still hold a
      reference to the memory cgroup.
      
      E.g.  We know that the kernel stack is charged as kmem pages because the
      size of the kernel stack can be greater than 2 pages (e.g.  16KB on
      x86_64 or arm64).  If we create a thread (suppose the thread stack is
      charged to memory cgroup A) and then move it from memory cgroup A to
      memory cgroup B.  Because the kernel stack of the thread hold a
      reference to the memory cgroup A.  The thread can pin the memory cgroup
      A in the memory even if we remove the cgroup A.  If we want to see this
      scenario by using the following script.  We can see that the system has
      added 500 dying cgroups (This is not a real world issue, just a script
      to show that the large kmallocs are charged as kmem pages which can pin
      the memory cgroup in the memory).
      
      	#!/bin/bash
      
      	cat /proc/cgroups | grep memory
      
      	cd /sys/fs/cgroup/memory
      	echo 1 > memory.move_charge_at_immigrate
      
      	for i in range{1..500}
      	do
      		mkdir kmem_test
      		echo $$ > kmem_test/cgroup.procs
      		sleep 3600 &
      		echo $$ > cgroup.procs
      		echo `cat kmem_test/cgroup.procs` > cgroup.procs
      		rmdir kmem_test
      	done
      
      	cat /proc/cgroups | grep memory
      
      This patchset aims to make those kmem pages to drop the reference to
      memory cgroup by using the APIs of obj_cgroup.  Finally, we can see that
      the number of the dying cgroups will not increase if we run the above test
      script.
      
      This patch (of 7):
      
      The rcu_read_lock/unlock only can guarantee that the memcg will not be
      freed, but it cannot guarantee the success of css_get (which is in the
      refill_stock when cached memcg changed) to memcg.
      
        rcu_read_lock()
        memcg = obj_cgroup_memcg(old)
        __memcg_kmem_uncharge(memcg)
            refill_stock(memcg)
                if (stock->cached != memcg)
                    // css_get can change the ref counter from 0 back to 1.
                    css_get(&memcg->css)
        rcu_read_unlock()
      
      This fix is very like the commit:
      
        eefbfa7f ("mm: memcg/slab: fix use after free in obj_cgroup_charge")
      
      Fix this by holding a reference to the memcg which is passed to the
      __memcg_kmem_uncharge() before calling __memcg_kmem_uncharge().
      
      Link: https://lkml.kernel.org/r/20210319163821.20704-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20210319163821.20704-2-songmuchun@bytedance.com
      Fixes: 3de7d4f2
      
       ("mm: memcg/slab: optimize objcg stock draining")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f38f03a
    • Shakeel Butt's avatar
      memcg: charge before adding to swapcache on swapin · 0add0c77
      Shakeel Butt authored
      
      
      Currently the kernel adds the page, allocated for swapin, to the
      swapcache before charging the page.  This is fine but now we want a
      per-memcg swapcache stat which is essential for folks who wants to
      transparently migrate from cgroup v1's memsw to cgroup v2's memory and
      swap counters.  In addition charging a page before exposing it to other
      parts of the kernel is a step in the right direction.
      
      To correctly maintain the per-memcg swapcache stat, this patch has
      adopted to charge the page before adding it to swapcache.  One challenge
      in this option is the failure case of add_to_swap_cache() on which we
      need to undo the mem_cgroup_charge().  Specifically undoing
      mem_cgroup_uncharge_swap() is not simple.
      
      To resolve the issue, this patch decouples the charging for swapin pages
      from mem_cgroup_charge().  Two new functions are introduced,
      mem_cgroup_swapin_charge_page() for just charging the swapin page and
      mem_cgroup_swapin_uncharge_swap() for uncharging the swap slot once the
      page has been successfully added to the swapcache.
      
      [shakeelb@google.com: set page->private before calling swap_readpage]
        Link: https://lkml.kernel.org/r/20210318015959.2986837-1-shakeelb@google.com
      
      Link: https://lkml.kernel.org/r/20210305212639.775498-1-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0add0c77
    • Johannes Weiner's avatar
      kselftests: cgroup: update kmem test for new vmstat implementation · 4bbcc5a4
      Johannes Weiner authored
      
      
      With memcg having switched to rstat, memory.stat output is precise.
      Update the cgroup selftest to reflect the expectations and error
      tolerances of the new implementation.
      
      Also add newly tracked types of memory to the memory.stat side of the
      equation, since they're included in memory.current and could throw false
      positives.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-9-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4bbcc5a4
    • Johannes Weiner's avatar
      mm: memcontrol: consolidate lruvec stat flushing · 2cd21c89
      Johannes Weiner authored
      
      
      There are two functions to flush the per-cpu data of an lruvec into the
      rest of the cgroup tree: when the cgroup is being freed, and when a CPU
      disappears during hotplug.  The difference is whether all CPUs or just
      one is being collected, but the rest of the flushing code is the same.
      Merge them into one function and share the common code.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-8-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2cd21c89
    • Johannes Weiner's avatar
      mm: memcontrol: switch to rstat · 2d146aa3
      Johannes Weiner authored
      
      
      Replace the memory controller's custom hierarchical stats code with the
      generic rstat infrastructure provided by the cgroup core.
      
      The current implementation does batched upward propagation from the
      write side (i.e.  as stats change).  The per-cpu batches introduce an
      error, which is multiplied by the number of subgroups in a tree.  In
      systems with many CPUs and sizable cgroup trees, the error can be large
      enough to confuse users (e.g.  32 batch pages * 32 CPUs * 32 subgroups
      results in an error of up to 128M per stat item).  This can entirely
      swallow allocation bursts inside a workload that the user is expecting
      to see reflected in the statistics.
      
      In the past, we've done read-side aggregation, where a memory.stat read
      would have to walk the entire subtree and add up per-cpu counts.  This
      became problematic with lazily-freed cgroups: we could have large
      subtrees where most cgroups were entirely idle.  Hence the switch to
      change-driven upward propagation.  Unfortunately, it needed to trade
      accuracy for speed due to the write side being so hot.
      
      Rstat combines the best of both worlds: from the write side, it cheaply
      maintains a queue of cgroups that have pending changes, so that the read
      side can do selective tree aggregation.  This way the reported stats
      will always be precise and recent as can be, while the aggregation can
      skip over potentially large numbers of idle cgroups.
      
      The way rstat works is that it implements a tree for tracking cgroups
      with pending local changes, as well as a flush function that walks the
      tree upwards.  The controller then drives this by 1) telling rstat when
      a local cgroup stat changes (e.g.  mod_memcg_state) and 2) when a flush
      is required to get uptodate hierarchy stats for a given subtree (e.g.
      when memory.stat is read).  The controller also provides a flush
      callback that is called during the rstat flush walk for each cgroup and
      aggregates its local per-cpu counters and propagates them upwards.
      
      This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT +
      NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward
      aggregation.  It removes 3 words from the per-cpu data.  It eliminates
      memcg_exact_page_state(), since memcg_page_state() is now exact.
      
      [akpm@linux-foundation.org: merge fix]
      [hannes@cmpxchg.org: fix a sleep in atomic section problem]
        Link: https://lkml.kernel.org/r/20210315234100.64307-1-hannes@cmpxchg.org
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-7-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d146aa3
    • Johannes Weiner's avatar
      cgroup: rstat: punt root-level optimization to individual controllers · dc26532a
      Johannes Weiner authored
      
      
      Current users of the rstat code can source root-level statistics from
      the native counters of their respective subsystem, allowing them to
      forego aggregation at the root level.  This optimization is currently
      implemented inside the generic rstat code, which doesn't track the root
      cgroup and doesn't invoke the subsystem flush callbacks on it.
      
      However, the memory controller cannot do this optimization, because
      cgroup1 breaks out memory specifically for the local level, including at
      the root level.  In preparation for the memory controller switching to
      rstat, move the optimization from rstat core to the controllers.
      
      Afterwards, rstat will always track the root cgroup for changes and
      invoke the subsystem callbacks on it; and it's up to the subsystem to
      special-case and skip aggregation of the root cgroup if it can source
      this information through other, cheaper means.
      
      This is the case for the io controller and the cgroup base stats.  In
      their respective flush callbacks, check whether the parent is the root
      cgroup, and if so, skip the unnecessary upward propagation.
      
      The extra cost of tracking the root cgroup is negligible: on stat
      changes, we actually remove a branch that checks for the root.  The
      queueing for a flush touches only per-cpu data, and only the first stat
      change since a flush requires a (per-cpu) lock.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-6-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc26532a
    • Johannes Weiner's avatar
      cgroup: rstat: support cgroup1 · a7df69b8
      Johannes Weiner authored
      
      
      Rstat currently only supports the default hierarchy in cgroup2.  In
      order to replace memcg's private stats infrastructure - used in both
      cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1.
      
      The initialization and destruction callbacks for regular cgroups are
      already in place.  Remove the cgroup_on_dfl() guards to handle cgroup1.
      
      The initialization of the root cgroup is currently hardcoded to only
      handle cgrp_dfl_root.cgrp.  Move those callbacks to cgroup_setup_root()
      and cgroup_destroy_root() to handle the default root as well as the
      various cgroup1 roots we may set up during mounting.
      
      The linking of css to cgroups happens in code shared between cgroup1 and
      cgroup2 as well.  Simply remove the cgroup_on_dfl() guard.
      
      Linkage of the root css to the root cgroup is a bit trickier: per
      default, the root css of a subsystem controller belongs to the default
      hierarchy (i.e.  the cgroup2 root).  When a controller is mounted in its
      cgroup1 version, the root css is stolen and moved to the cgroup1 root;
      on unmount, the css moves back to the default hierarchy.  Annotate
      rebind_subsystems() to move the root css linkage along between roots.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-5-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7df69b8
    • Johannes Weiner's avatar
      mm: memcontrol: privatize memcg_page_state query functions · a18e6e6e
      Johannes Weiner authored
      
      
      There are no users outside of the memory controller itself. The rest
      of the kernel cares either about node or lruvec stats.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-4-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a18e6e6e
    • Johannes Weiner's avatar
      mm: memcontrol: kill mem_cgroup_nodeinfo() · a3747b53
      Johannes Weiner authored
      
      
      No need to encapsulate a simple struct member access.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-3-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3747b53
    • Johannes Weiner's avatar
      mm: memcontrol: fix cpuhotplug statistics flushing · a3d4c05a
      Johannes Weiner authored
      Patch series "mm: memcontrol: switch to rstat", v3.
      
      This series converts memcg stats tracking to the streamlined rstat
      infrastructure provided by the cgroup core code.  rstat is already used by
      the CPU controller and the IO controller.  This change is motivated by
      recent accuracy problems in memcg's custom stats code, as well as the
      benefits of sharing common infra with other controllers.
      
      The current memcg implementation does batched tree aggregation on the
      write side: local stat changes are cached in per-cpu counters, which are
      then propagated upward in batches when a threshold (32 pages) is exceeded.
      This is cheap, but the error introduced by the lazy upward propagation
      adds up: 32 pages times CPUs times cgroups in the subtree.  We've had
      complaints from service owners that the stats do not reliably track and
      react to allocation behavior as expected, sometimes swallowing the results
      of entire test applications.
      
      The original memcg stat implementation used to do tree aggregation
      exclusively on the read side: local stats would only ever be tracked in
      per-cpu counters, and a memory.stat read would iterate the entire subtree
      and sum those counters up.  This didn't keep up with the times:
      
       - Cgroup trees are much bigger now. We switched to lazily-freed
         cgroups, where deleted groups would hang around until their remaining
         page cache has been reclaimed. This can result in large subtrees that
         are expensive to walk, while most of the groups are idle and their
         statistics don't change much anymore.
      
       - Automated monitoring increased. With the proliferation of userspace
         oom killing, proactive reclaim, and higher-resolution logging of
         workload trends in general, top-level stat files are polled at least
         once a second in many deployments.
      
       - The lifetime of cgroups got shorter. Where most cgroup setups in the
         past would have a few large policy-oriented cgroups for everything
         running on the system, newer cgroup deployments tend to create one
         group per application - which gets deleted again as the processes
         exit. An aggregation scheme that doesn't retain child data inside the
         parents loses event history of the subtree.
      
      Rstat addresses all three of those concerns through intelligent,
      persistent read-side aggregation.  As statistics change at the local
      level, rstat tracks - on a per-cpu basis - only those parts of a subtree
      that have changes pending and require aggregation.  The actual
      aggregation occurs on the colder read side - which can now skip over
      (potentially large) numbers of recently idle cgroups.
      
      ===
      
      The test_kmem cgroup selftest is currently failing due to excessive
      cumulative vmstat drift from 100 subgroups:
      
          ok 1 test_kmem_basic
          memory.current = 8810496
          slab + anon + file + kernel_stack = 17074568
          slab = 6101384
          anon = 946176
          file = 0
          kernel_stack = 10027008
          not ok 2 test_kmem_memcg_deletion
          ok 3 test_kmem_proc_kpagecgroup
          ok 4 test_kmem_kernel_stacks
          ok 5 test_kmem_dead_cgroups
          ok 6 test_percpu_basic
      
      As you can see, memory.stat items far exceed memory.current.  The kernel
      stack alone is bigger than all of charged memory.  That's because the
      memory of the test has been uncharged from memory.current, but the
      negative vmstat deltas are still sitting in the percpu caches.
      
      The test at this time isn't even counting percpu, pagetables etc.  yet,
      which would further contribute to the error.  The last patch in the series
      updates the test to include them - as well as reduces the vmstat
      tolerances in general to only expect page_counter batching.
      
      With all patches applied, the (now more stringent) test succeeds:
      
          ok 1 test_kmem_basic
          ok 2 test_kmem_memcg_deletion
          ok 3 test_kmem_proc_kpagecgroup
          ok 4 test_kmem_kernel_stacks
          ok 5 test_kmem_dead_cgroups
          ok 6 test_percpu_basic
      
      ===
      
      A kernel build test confirms that overhead is comparable.  Two kernels are
      built simultaneously in a nested tree with several idle siblings:
      
      root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16)
                                                   `- build-b (defconfig, make -j16)
                                                   `- idle-1
                                                   `- ...
                                                   `- idle-9
      
      During the builds, kernelbuild/memory.stat is read once a second.
      
      A perf diff shows that the changes in cycle distribution is
      minimal. Top 10 kernel symbols:
      
           0.09%     +0.08%  [kernel.kallsyms]                       [k] __mod_memcg_lruvec_state
           0.00%     +0.06%  [kernel.kallsyms]                       [k] cgroup_rstat_updated
           0.08%     -0.05%  [kernel.kallsyms]                       [k] __mod_memcg_state.part.0
           0.16%     -0.04%  [kernel.kallsyms]                       [k] release_pages
           0.00%     +0.03%  [kernel.kallsyms]                       [k] __count_memcg_events
           0.01%     +0.03%  [kernel.kallsyms]                       [k] mem_cgroup_charge_statistics.constprop.0
           0.10%     -0.02%  [kernel.kallsyms]                       [k] get_mem_cgroup_from_mm
           0.05%     -0.02%  [kernel.kallsyms]                       [k] mem_cgroup_update_lru_size
           0.57%     +0.01%  [kernel.kallsyms]                       [k] asm_exc_page_fault
      
      ===
      
      The on-demand aggregated stats are now fully accurate:
      
      $ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \
        grep -e inactive_file /sys/fs/cgroup/memory.stat
      
      vanilla:                              patched:
      nr_inactive_file 1574105088           nr_inactive_file 1027801088
         inactive_file 1577410560              inactive_file 1027801088
      
      ===
      
      This patch (of 8):
      
      The memcg hotunplug callback erroneously flushes counts on the local CPU,
      not the counts of the CPU going away; those counts will be lost.
      
      Flush the CPU that is actually going away.
      
      Also simplify the code a bit by using mod_memcg_state() and
      count_memcg_events() instead of open-coding the upward flush - this is
      comparable to how vmstat.c handles hotunplug flushing.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-1-hannes@cmpxchg.org
      Link: https://lkml.kernel.org/r/20210209163304.77088-2-hannes@cmpxchg.org
      Fixes: a983b5eb
      
       ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3d4c05a
    • Shakeel Butt's avatar
      memcg: enable memcg oom-kill for __GFP_NOFAIL · 3d0cbb98
      Shakeel Butt authored
      In the era of async memcg oom-killer, the commit a0d8b00a ("mm: memcg:
      do not declare OOM from __GFP_NOFAIL allocations") added the code to skip
      memcg oom-killer for __GFP_NOFAIL allocations.  The reason was that the
      __GFP_NOFAIL callers will not enter aync oom synchronization path and will
      keep the task marked as in memcg oom.  At that time the tasks marked in
      memcg oom can bypass the memcg limits and the oom synchronization would
      have happened later in the later userspace triggered page fault.  Thus
      letting the task marked as under memcg oom bypass the memcg limit for
      arbitrary time.
      
      With the synchronous memcg oom-killer (commit 29ef680a ("memcg, oom:
      move out_of_memory back to the charge path")) and not letting the task
      marked under memcg oom to bypass the memcg limits (commit 1f14c1ac
      
      
      ("mm: memcg: do not allow task about to OOM kill to bypass the limit")),
      we can again allow __GFP_NOFAIL allocations to trigger memcg oom-kill.
      This will make memcg oom behavior closer to page allocator oom behavior.
      
      Link: https://lkml.kernel.org/r/20210223204337.2785120-1-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d0cbb98
    • Shakeel Butt's avatar
      memcg: cleanup root memcg checks · a4792030
      Shakeel Butt authored
      
      
      Replace the implicit checking of root memcg with explicit root memcg
      checking i.e.  !css->parent with mem_cgroup_is_root().
      
      Link: https://lkml.kernel.org/r/20210223205625.2792891-1-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4792030
    • Muchun Song's avatar
      mm: memcontrol: fix kernel stack account · 27faca83
      Muchun Song authored
      For simplification commit 991e7673
      
       ("mm: memcontrol: account kernel
      stack per node") changed the per zone vmalloc backed stack pages
      accounting to per node.
      
      By doing that we have lost a certain precision because those pages might
      live in different NUMA nodes.  In the end NR_KERNEL_STACK_KB exported to
      the userspace might be over estimated on some nodes while underestimated
      on others.  But this is not a real world problem, just a problem found
      by reading the code.  So there is no actual data to showing how much
      impact it has on users.
      
      This doesn't impose any real problem to correctnes of the kernel
      behavior as the counter is not used for any internal processing but it
      can cause some confusion to the userspace.
      
      Address the problem by accounting each vmalloc backing page to its own
      node.
      
      Link: https://lkml.kernel.org/r/20210303151843.81156-1-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27faca83
    • Zhiyuan Dai's avatar
      mm/memremap.c: fix improper SPDX comment style · 2840d498
      Zhiyuan Dai authored
      
      
      Replace /* */ comment with //, fix SPDX comment style.
      
      see: Documentation/process/license-rules.rst
      
      Link: https://lkml.kernel.org/r/1614223348-15516-1-git-send-email-daizhiyuan@phytium.com.cn
      Signed-off-by: default avatarZhiyuan Dai <daizhiyuan@phytium.com.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2840d498
    • Yang Shi's avatar
      mm: gup: remove FOLL_SPLIT · 4066c119
      Yang Shi authored
      Since commit 5a52c9df ("uprobe: use FOLL_SPLIT_PMD instead of
      FOLL_SPLIT") and commit ba925fa3
      
       ("s390/gmap: improve THP splitting")
      FOLL_SPLIT has not been used anymore.  Remove the dead code.
      
      Link: https://lkml.kernel.org/r/20210330203900.9222-1-shy828301@gmail.com
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4066c119
    • Joao Martins's avatar
      RDMA/umem: batch page unpin in __ib_umem_release() · 1d4b0166
      Joao Martins authored
      
      
      Use the newly added unpin_user_page_range_dirty_lock() for more quickly
      unpinning a consecutive range of pages represented as compound pages.
      This will also calculate number of pages to unpin (for the tail pages
      which matching head page) and thus batch the refcount update.
      
      Running a test program which calls memory range reg/unreg on a region 1G
      in size and measures cost of both operations together (in a guest using
      rxe) with THP and hugetlbfs:
      
      Before:
        590 rounds in 5.003 sec: 8480.335 usec / round
        6898 rounds in 60.001 sec: 8698.367 usec / round
      
      After:
        2688 rounds in 5.002 sec: 1860.786 usec / round
        32517 rounds in 60.001 sec: 1845.225 usec / round
      
      Link: https://lkml.kernel.org/r/20210212130843.13865-5-joao.m.martins@oracle.com
      Signed-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Acked-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d4b0166
    • Joao Martins's avatar
      mm/gup: add a range variant of unpin_user_pages_dirty_lock() · 458a4f78
      Joao Martins authored
      
      
      Add an unpin_user_page_range_dirty_lock() API which takes a starting page
      and how many consecutive pages we want to unpin and optionally dirty.
      
      To that end, define another iterator for_each_compound_range() that
      operates in page ranges as opposed to page array.
      
      For users (like RDMA mr_dereg) where each sg represents a contiguous set
      of pages, we're able to more efficiently unpin pages without having to
      supply an array of pages much of what happens today with
      unpin_user_pages().
      
      Link: https://lkml.kernel.org/r/20210212130843.13865-4-joao.m.martins@oracle.com
      Suggested-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      458a4f78
    • Joao Martins's avatar
      mm/gup: decrement head page once for group of subpages · 31b912de
      Joao Martins authored
      
      
      Rather than decrementing the head page refcount one by one, we walk the
      page array and checking which belong to the same compound_head.  Later on
      we decrement the calculated amount of references in a single write to the
      head page.  To that end switch to for_each_compound_head() does most of
      the work.
      
      set_page_dirty() needs no adjustment as it's a nop for non-dirty head
      pages and it doesn't operate on tail pages.
      
      This considerably improves unpinning of pages with THP and hugetlbfs:
      
       - THP
      
         gup_test -t -m 16384 -r 10 [-L|-a] -S -n 512 -w
         PIN_LONGTERM_BENCHMARK (put values): ~87.6k us -> ~23.2k us
      
      - 16G with 1G huge page size
      
        gup_test -f /mnt/huge/file -m 16384 -r 10 [-L|-a] -S -n 512 -w
        PIN_LONGTERM_BENCHMARK: (put values): ~87.6k us -> ~27.5k us
      
      Link: https://lkml.kernel.org/r/20210212130843.13865-3-joao.m.martins@oracle.com
      Signed-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31b912de
    • Joao Martins's avatar
      mm/gup: add compound page list iterator · 8745d7f6
      Joao Martins authored
      
      
      Patch series "mm/gup: page unpining improvements", v4.
      
      This series improves page unpinning, with an eye on improving MR
      deregistration for big swaths of memory (which is bound by the page
      unpining), particularly:
      
      1) Decrement the head page by @ntails and thus reducing a lot the
         number of atomic operations per compound page.  This is done by
         comparing individual tail pages heads, and counting number of
         consecutive tails on which they match heads and based on that update
         head page refcount.  Should have a visible improvement in all page
         (un)pinners which use compound pages
      
      2) Introducing a new API for unpinning page ranges (to avoid the trick
         in the previous item and be based on math), and use that in RDMA
         ib_mem_release (used for mr deregistration).
      
      Performance improvements: unpin_user_pages() for hugetlbfs and THP
      improves ~3x (through gup_test) and RDMA MR dereg improves ~4.5x with the
      new API.  See patches 2 and 4 for those.
      
      This patch (of 4):
      
      Add a helper that iterates over head pages in a list of pages.  It
      essentially counts the tails until the next page to process has a
      different head that the current.  This is going to be used by
      unpin_user_pages() family of functions, to batch the head page refcount
      updates once for all passed consecutive tail pages.
      
      Link: https://lkml.kernel.org/r/20210212130843.13865-1-joao.m.martins@oracle.com
      Link: https://lkml.kernel.org/r/20210212130843.13865-2-joao.m.martins@oracle.com
      Signed-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Suggested-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8745d7f6
    • Nikita Ermakov's avatar
      mm/msync: exit early when the flags is an MS_ASYNC and start < vm_start · f6899bc0
      Nikita Ermakov authored
      
      
      If an unmapped region was found and the flag is MS_ASYNC (without
      MS_INVALIDATE) there is nothing to do and the result would be always
      -ENOMEM, so return immediately.
      
      Link: https://lkml.kernel.org/r/20201025092901.56399-1-sh1r4s3@mail.si-head.nl
      Signed-off-by: default avatarNikita Ermakov <sh1r4s3@mail.si-head.nl>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6899bc0
    • Rui Sun's avatar
      mm/filemap: update stale comment · 4b17f030
      Rui Sun authored
      Commit a6de4b48
      
       ("mm: convert find_get_entry to return the head page")
      uses @index instead of @offset, but the comment is stale, update it.
      
      Link: https://lkml.kernel.org/r/1617948260-50724-1-git-send-email-zhangshaokun@hisilicon.com
      Signed-off-by: default avatarRui Sun <sunrui26@huawei.com>
      Signed-off-by: default avatarShaokun Zhang <zhangshaokun@hisilicon.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b17f030
    • Matthew Wilcox (Oracle)'s avatar
      mm: move page_mapping_file to pagemap.h · 842ca547
      Matthew Wilcox (Oracle) authored
      
      
      page_mapping_file() is only used by some architectures, and then it
      is usually only used in one place.  Make it a static inline function
      so other architectures don't have to carry this dead code.
      
      Link: https://lkml.kernel.org/r/20210317123011.350118-1-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      842ca547
    • Johannes Weiner's avatar
      mm: page-writeback: simplify memcg handling in test_clear_page_writeback() · 1c824a68
      Johannes Weiner authored
      Page writeback doesn't hold a page reference, which allows truncate to
      free a page the second PageWriteback is cleared.  This used to require
      special attention in test_clear_page_writeback(), where we had to be
      careful not to rely on the unstable page->memcg binding and look up all
      the necessary information before clearing the writeback flag.
      
      Since commit 073861ed
      
       ("mm: fix VM_BUG_ON(PageTail) and
      BUG_ON(PageWriteback)") test_clear_page_writeback() is called with an
      explicit reference on the page, and this dance is no longer needed.
      
      Use unlock_page_memcg() and dec_lruvec_page_state() directly.
      
      This removes the last user of the lock_page_memcg() return value, change
      it to void.  Touch up the comments in there as well.  This also removes
      the last extern user of __unlock_page_memcg(), make it static.  Further,
      it removes the last user of dec_lruvec_state(), delete it, along with a
      few other unused helpers.
      
      Link: https://lkml.kernel.org/r/YCQbYAWg4nvBFL6h@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c824a68
    • Matthew Wilcox (Oracle)'s avatar
      mm/filemap: drop check for truncated page after I/O · 79e3094c
      Matthew Wilcox (Oracle) authored
      If the I/O completed successfully, the page will remain Uptodate, even
      if it is subsequently truncated.  If the I/O completed with an error,
      this check would cause us to retry the I/O if the page were truncated
      before we woke up.  There is no need to retry the I/O; the I/O to fill
      the page failed, so we can legitimately just return -EIO.
      
      This code was originally added by commit 56f0d5fe6851 ("[PATCH]
      readpage-vs-invalidate fix") in 2005 (this commit ID is from the
      linux-fullhistory tree; it is also commit ba1f08f14b52 in tglx-history).
      
      At the time, truncate_complete_page() called ClearPageUptodate(), and so
      this was fixing a real bug.  In 2008, commit 84209e02
      
       ("mm: dont clear
      PG_uptodate on truncate/invalidate") removed the call to
      ClearPageUptodate, and this check has been unnecessary ever since.
      
      It doesn't do any real harm, but there's no need to keep it.
      
      Link: https://lkml.kernel.org/r/20210303222547.1056428-1-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79e3094c
    • Matthew Wilcox (Oracle)'s avatar
      mm/filemap: use filemap_read_page in filemap_fault · d31fa86a
      Matthew Wilcox (Oracle) authored
      
      
      After splitting generic_file_buffered_read() into smaller parts, it turns
      out we can reuse one of the parts in filemap_fault().  This fixes an
      oversight -- waiting for the I/O to complete is now interruptible by a
      fatal signal.  And it saves us a few bytes of text in an unlikely path.
      
        $ ./scripts/bloat-o-meter before.o after.o
        add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-207 (-207)
        Function                                     old     new   delta
        filemap_fault                               2187    1980    -207
        Total: Before=37491, After=37284, chg -0.55%
      
      Link: https://lkml.kernel.org/r/20210226140011.2883498-1-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d31fa86a
    • Jens Axboe's avatar
      iomap: use filemap_range_needs_writeback() for O_DIRECT reads · 985b71db
      Jens Axboe authored
      
      
      For reads, use the better variant of checking for the need to call
      filemap_write_and_wait_range() when doing O_DIRECT.  This avoids falling
      back to the slow path for IOCB_NOWAIT, if there are no pages to wait for
      (or write out).
      
      Link: https://lkml.kernel.org/r/20210224164455.1096727-4-axboe@kernel.dk
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      985b71db
    • Jens Axboe's avatar
      mm: use filemap_range_needs_writeback() for O_DIRECT reads · 7a60d6d7
      Jens Axboe authored
      
      
      For the generic page cache read helper, use the better variant of checking
      for the need to call filemap_write_and_wait_range() when doing O_DIRECT
      reads.  This avoids falling back to the slow path for IOCB_NOWAIT, if
      there are no pages to wait for (or write out).
      
      Link: https://lkml.kernel.org/r/20210224164455.1096727-3-axboe@kernel.dk
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a60d6d7
    • Jens Axboe's avatar
      mm: provide filemap_range_needs_writeback() helper · 63135aa3
      Jens Axboe authored
      
      
      Patch series "Improve IOCB_NOWAIT O_DIRECT reads", v3.
      
      An internal workload complained because it was using too much CPU, and
      when I took a look, we had a lot of io_uring workers going to town.
      
      For an async buffered read like workload, I am normally expecting _zero_
      offloads to a worker thread, but this one had tons of them.  I'd drop
      caches and things would look good again, but then a minute later we'd
      regress back to using workers.  Turns out that every minute something
      was reading parts of the device, which would add page cache for that
      inode.  I put patches like these in for our kernel, and the problem was
      solved.
      
      Don't -EAGAIN IOCB_NOWAIT dio reads just because we have page cache
      entries for the given range.  This causes unnecessary work from the
      callers side, when the IO could have been issued totally fine without
      blocking on writeback when there is none.
      
      This patch (of 3):
      
      For O_DIRECT reads/writes, we check if we need to issue a call to
      filemap_write_and_wait_range() to issue and/or wait for writeback for any
      page in the given range.  The existing mechanism just checks for a page in
      the range, which is suboptimal for IOCB_NOWAIT as we'll fallback to the
      slow path (and needing retry) if there's just a clean page cache page in
      the range.
      
      Provide filemap_range_needs_writeback() which tries a little harder to
      check if we actually need to issue and/or wait for writeback in the range.
      
      Link: https://lkml.kernel.org/r/20210224164455.1096727-1-axboe@kernel.dk
      Link: https://lkml.kernel.org/r/20210224164455.1096727-2-axboe@kernel.dk
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63135aa3
    • Anshuman Khandual's avatar
      mm/memtest: add ARCH_USE_MEMTEST · dce44566
      Anshuman Khandual authored
      
      
      early_memtest() does not get called from all architectures.  Hence
      enabling CONFIG_MEMTEST and providing a valid memtest=[1..N] kernel
      command line option might not trigger the memory pattern tests as would be
      expected in normal circumstances.  This situation is misleading.
      
      The change here prevents the above mentioned problem after introducing a
      new config option ARCH_USE_MEMTEST that should be subscribed on platforms
      that call early_memtest(), in order to enable the config CONFIG_MEMTEST.
      Conversely CONFIG_MEMTEST cannot be enabled on platforms where it would
      not be tested anyway.
      
      Link: https://lkml.kernel.org/r/1617269193-22294-1-git-send-email-anshuman.khandual@arm.com
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: Catalin Marinas <catalin.marinas@arm.com> (arm64)
      Reviewed-by: default avatarMax Filippov <jcmvbkbc@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will@kernel.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Chris Zankel <chris@zankel.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dce44566
    • Sergei Trofimovich's avatar
      mm: page_poison: print page info when corruption is caught · f58bd538
      Sergei Trofimovich authored
      
      
      When page_poison detects page corruption it's useful to see who freed a
      page recently to have a guess where write-after-free corruption happens.
      
      After this change corruption report has extra page data.
      Example report from real corruption (includes only page_pwner part):
      
          pagealloc: memory corruption
          e00000014cd61d10: 11 00 00 00 00 00 00 00 30 1d d2 ff ff 0f 00 60  ........0......`
          e00000014cd61d20: b0 1d d2 ff ff 0f 00 60 90 fe 1c 00 08 00 00 20  .......`.......
          ...
          CPU: 1 PID: 220402 Comm: cc1plus Not tainted 5.12.0-rc5-00107-g9720c6f59ecf #245
          Hardware name: hp server rx3600, BIOS 04.03 04/08/2008
          ...
          Call Trace:
           [<a000000100015210>] show_stack+0x90/0xc0
           [<a000000101163390>] dump_stack+0x150/0x1c0
           [<a0000001003f1e90>] __kernel_unpoison_pages+0x410/0x440
           [<a0000001003c2460>] get_page_from_freelist+0x1460/0x2ca0
           [<a0000001003c6be0>] __alloc_pages_nodemask+0x3c0/0x660
           [<a0000001003ed690>] alloc_pages_vma+0xb0/0x500
           [<a00000010037deb0>] __handle_mm_fault+0x1230/0x1fe0
           [<a00000010037ef70>] handle_mm_fault+0x310/0x4e0
           [<a00000010005dc70>] ia64_do_page_fault+0x1f0/0xb80
           [<a00000010000ca00>] ia64_leave_kernel+0x0/0x270
          page_owner tracks the page as freed
          page allocated via order 0, migratetype Movable,
            gfp_mask 0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), pid 37, ts 8173444098740
           __reset_page_owner+0x40/0x200
           free_pcp_prepare+0x4d0/0x600
           free_unref_page+0x20/0x1c0
           __put_page+0x110/0x1a0
           migrate_pages+0x16d0/0x1dc0
           compact_zone+0xfc0/0x1aa0
           proactive_compact_node+0xd0/0x1e0
           kcompactd+0x550/0x600
           kthread+0x2c0/0x2e0
           call_payload+0x50/0x80
      
      Here we can see that page was freed by page migration but something
      managed to write to it afterwards.
      
      [slyfox@gentoo.org: s/dump_page_owner/dump_page/, per Vlastimil]
        Link: https://lkml.kernel.org/r/20210407230800.1086854-1-slyfox@gentoo.org
      
      Link: https://lkml.kernel.org/r/20210404141735.2152984-1-slyfox@gentoo.org
      Signed-off-by: default avatarSergei Trofimovich <slyfox@gentoo.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f58bd538
    • Sergei Trofimovich's avatar
      mm: page_owner: detect page_owner recursion via task_struct · 8e9b16c4
      Sergei Trofimovich authored
      
      
      Before the change page_owner recursion was detected via fetching
      backtrace and inspecting it for current instruction pointer.
      It has a few problems:
      
       - it is slightly slow as it requires extra backtrace and a linear stack
         scan of the result
      
       - it is too late to check if backtrace fetching required memory
         allocation itself (ia64's unwinder requires it).
      
      To simplify recursion tracking let's use page_owner recursion flag in
      'struct task_struct'.
      
      The change make page_owner=on work on ia64 by avoiding infinite
      recursion in:
        kmalloc()
        -> __set_page_owner()
        -> save_stack()
        -> unwind() [ia64-specific]
        -> build_script()
        -> kmalloc()
        -> __set_page_owner() [we short-circuit here]
        -> save_stack()
        -> unwind() [recursion]
      
      Link: https://lkml.kernel.org/r/20210402115342.1463781-1-slyfox@gentoo.org
      Signed-off-by: default avatarSergei Trofimovich <slyfox@gentoo.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e9b16c4
    • Sergei Trofimovich's avatar
      mm: page_owner: use kstrtobool() to parse bool option · 608b5d66
      Sergei Trofimovich authored
      
      
      I tried to use page_owner=1 for a while noticed too late it had no effect
      as opposed to similar init_on_alloc=1 (these work).
      
      Let's make them consistent.
      
      The change decreses binary size slightly:
         text    data     bss     dec     hex filename
        12408     321      17   12746    31ca mm/page_owner.o.before
        12320     321      17   12658    3172 mm/page_owner.o.after
      
      Link: https://lkml.kernel.org/r/20210401210909.3532086-1-slyfox@gentoo.org
      Signed-off-by: default avatarSergei Trofimovich <slyfox@gentoo.org>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      608b5d66
    • Sergei Trofimovich's avatar
      mm: page_owner: fetch backtrace only for tracked pages · fab765c2
      Sergei Trofimovich authored
      
      
      Very minor optimization.
      
      Link: https://lkml.kernel.org/r/20210401212445.3534721-1-slyfox@gentoo.org
      Signed-off-by: default avatarSergei Trofimovich <slyfox@gentoo.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fab765c2
    • zhongjiang-ali's avatar
      mm, page_owner: remove unused parameter in __set_page_owner_handle · 64ea78d2
      zhongjiang-ali authored
      Since commit 5556cfe8
      
       ("mm, page_owner: fix off-by-one error in
      __set_page_owner_handle()") introduced, the parameter 'page' will not
      used, hence it need to be removed.
      
      Link: https://lkml.kernel.org/r/1616602022-43545-1-git-send-email-zhongjiang-ali@linux.alibaba.com
      Signed-off-by: default avatarzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64ea78d2
    • Georgi Djakov's avatar
      mm/page_owner: record the timestamp of all pages during free · 866b4852
      Georgi Djakov authored
      
      
      Collect the time when each allocation is freed, to help with memory
      analysis with kdump/ramdump.  Add the timestamp also in the page_owner
      debugfs file and print it in dump_page().
      
      Having another timestamp when we free the page helps for debugging page
      migration issues.  For example both alloc and free timestamps being the
      same can gave hints that there is an issue with migrating memory, as
      opposed to a page just being dropped during migration.
      
      Link: https://lkml.kernel.org/r/20210203175905.12267-1-georgi.djakov@linaro.org
      Signed-off-by: default avatarGeorgi Djakov <georgi.djakov@linaro.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      866b4852
    • Bhaskar Chowdhury's avatar
      mm/kmemleak.c: fix a typo · 0b5121ef
      Bhaskar Chowdhury authored
      
      
      s/interruptable/interruptible/
      
      Link: https://lkml.kernel.org/r/20210319214140.23304-1-unixbhaskar@gmail.com
      Signed-off-by: default avatarBhaskar Chowdhury <unixbhaskar@gmail.com>
      Acked-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b5121ef
    • Bhaskar Chowdhury's avatar
      mm/slub.c: trivial typo fixes · dc84207d
      Bhaskar Chowdhury authored
      
      
      s/operatios/operations/
      s/Mininum/Minimum/
      s/mininum/minimum/  ......two different places.
      
      Link: https://lkml.kernel.org/r/20210325044940.14516-1-unixbhaskar@gmail.com
      Signed-off-by: default avatarBhaskar Chowdhury <unixbhaskar@gmail.com>
      Acked-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc84207d
    • Vlastimil Babka's avatar
      mm, slub: enable slub_debug static key when creating cache with explicit debug flags · 1f0723a4
      Vlastimil Babka authored
      Commit ca0cab65 ("mm, slub: introduce static key for slub_debug()")
      introduced a static key to optimize the case where no debugging is
      enabled for any cache.  The static key is enabled when slub_debug boot
      parameter is passed, or CONFIG_SLUB_DEBUG_ON enabled.
      
      However, some caches might be created with one or more debugging flags
      explicitly passed to kmem_cache_create(), and the commit missed this.
      Thus the debugging functionality would not be actually performed for
      these caches unless the static key gets enabled by boot param or config.
      
      This patch fixes it by checking for debugging flags passed to
      kmem_cache_create() and enabling the static key accordingly.
      
      Note such explicit debugging flags should not be used outside of
      debugging and testing as they will now enable the static key globally.
      btrfs_init_cachep() creates a cache with SLAB_RED_ZONE but that's a
      mistake that's being corrected [1].  rcu_torture_stats() creates a cache
      with SLAB_STORE_USER, but that is a testing module so it's OK and will
      start working as intended after this patch.
      
      Also note that in case of backports to kernels before v5.12 that don't
      have 59450bbc ("mm, slab, slub: stop taking cpu hotplug lock"),
      static_branch_enable_cpuslocked() should be used.
      
      [1] https://lore.kernel.org/linux-btrfs/20210315141824.26099-1-dsterba@suse.com/
      
      Link: https://lkml.kernel.org/r/20210315153415.24404-1-vbabka@suse.cz
      Fixes: ca0cab65
      
       ("mm, slub: introduce static key for slub_debug()")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarOliver Glitta <glittao@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f0723a4
    • Rafael Aquini's avatar
      mm/slab_common: provide "slab_merge" option for !IS_ENABLED(CONFIG_SLAB_MERGE_DEFAULT) builds · 82edd9d5
      Rafael Aquini authored
      
      
      This is a minor addition to the allocator setup options to provide a
      simple way to on demand enable back cache merging for builds that by
      default run with CONFIG_SLAB_MERGE_DEFAULT not set.
      
      Link: https://lkml.kernel.org/r/20210319194506.200159-1-aquini@redhat.com
      Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82edd9d5
    • Petr Mladek's avatar
      watchdog: cleanup handling of false positives · 9bf3bc94
      Petr Mladek authored
      Commit d6ad3e28 ("softlockup: Add sched_clock_tick() to avoid kernel
      warning on kgdb resume") introduced touch_softlockup_watchdog_sync().
      
      It solved a problem when the watchdog was touched in an atomic context,
      the timer callback was proceed right after releasing interrupts, and the
      local clock has not been updated yet.  In this case, sched_clock_tick()
      was called in watchdog_timer_fn() before updating the timer.
      
      So far so good.
      
      Later commit 5d1c0f4a
      
       ("watchdog: add check for suspended vm in
      softlockup detector") added two kvm_check_and_clear_guest_paused()
      calls.  They touch the watchdog when the guest has been sleeping.
      
      The code makes my head spin around.
      
      Scenario 1:
      
          + guest did sleep:
      	+ PVCLOCK_GUEST_STOPPED is set
      
          + 1st watchdog_timer_fn() invocation:
      	+ the watchdog is not touched yet
      	+ is_softlockup() returns too big delay
      	+ kvm_check_and_clear_guest_paused():
      	   + clear PVCLOCK_GUEST_STOPPED
      	   + call touch_softlockup_watchdog_sync()
      		+ set SOFTLOCKUP_DELAY_REPORT
      		+ set softlockup_touch_sync
      	+ return from the timer callback
      
            + 2nd watchdog_timer_fn() invocation:
      
      	+ call sched_clock_tick() even though it is not needed.
      	  The timer callback was invoked again only because the clock
      	  has already been updated in the meantime.
      
      	+ call kvm_check_and_clear_guest_paused() that does nothing
      	  because PVCLOCK_GUEST_STOPPED has been cleared already.
      
      	+ call update_report_ts() and return. This is fine. Except
      	  that sched_clock_tick() might allow to set it already
      	  during the 1st invocation.
      
      Scenario 2:
      
      	+ guest did sleep
      
      	+ 1st watchdog_timer_fn() invocation
      	    + same as in 1st scenario
      
      	+ guest did sleep again:
      	    + set PVCLOCK_GUEST_STOPPED again
      
      	+ 2nd watchdog_timer_fn() invocation
      	    + SOFTLOCKUP_DELAY_REPORT is set from 1st invocation
      	    + call sched_clock_tick()
      	    + call kvm_check_and_clear_guest_paused()
      		+ clear PVCLOCK_GUEST_STOPPED
      		+ call touch_softlockup_watchdog_sync()
      		    + set SOFTLOCKUP_DELAY_REPORT
      		    + set softlockup_touch_sync
      	    + call update_report_ts() (set real timestamp immediately)
      	    + return from the timer callback
      
      	+ 3rd watchdog_timer_fn() invocation
      	    + timestamp is set from 2nd invocation
      	    + softlockup_touch_sync is set but not checked because
      	      the real timestamp is already set
      
      Make the code more straightforward:
      
      1. Always call kvm_check_and_clear_guest_paused() at the very
         beginning to handle PVCLOCK_GUEST_STOPPED. It touches the watchdog
         when the quest did sleep.
      
      2. Handle the situation when the watchdog has been touched
         (SOFTLOCKUP_DELAY_REPORT is set).
      
         Call sched_clock_tick() when touch_*sync() variant was used. It makes
         sure that the timestamp will be up to date even when it has been
         touched in atomic context or quest did sleep.
      
      As a result, kvm_check_and_clear_guest_paused() is called on a single
      location.  And the right timestamp is always set when returning from the
      timer callback.
      
      Link: https://lkml.kernel.org/r/20210311122130.6788-7-pmladek@suse.com
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Laurence Oberman <loberman@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9bf3bc94