Skip to content
  1. Nov 10, 2021
    • David Hildenbrand's avatar
      proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacks · cc5f2704
      David Hildenbrand authored
      Let's support multiple registered callbacks, making sure that
      registering vmcore callbacks cannot fail.  Make the callback return a
      bool instead of an int, handling how to deal with errors internally.
      Drop unused HAVE_OLDMEM_PFN_IS_RAM.
      
      We soon want to make use of this infrastructure from other drivers:
      virtio-mem, registering one callback for each virtio-mem device, to
      prevent reading unplugged virtio-mem memory.
      
      Handle it via a generic vmcore_cb structure, prepared for future
      extensions: for example, once we support virtio-mem on s390x where the
      vmcore is completely constructed in the second kernel, we want to detect
      and add plugged virtio-mem memory ranges to the vmcore in order for them
      to get dumped properly.
      
      Handle corner cases that are unexpected and shouldn't happen in sane
      setups: registering a callback after the vmcore has already been opened
      (warn only) and unregistering a callback after the vmcore has already been
      opened (warn and essentially read only zeroes from that point on).
      
      Link: https://lkml.kernel.org/r/20211005121430.30136-6-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc5f2704
    • David Hildenbrand's avatar
      proc/vmcore: let pfn_is_ram() return a bool · 2c9feeae
      David Hildenbrand authored
      The callback should deal with errors internally, it doesn't make sense
      to expose these via pfn_is_ram().  We'll rework the callbacks next.
      Right now we consider errors as if "it's RAM"; no functional change.
      
      Link: https://lkml.kernel.org/r/20211005121430.30136-5-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c9feeae
    • David Hildenbrand's avatar
      x86/xen: print a warning when HVMOP_get_mem_type fails · 934fadf4
      David Hildenbrand authored
      HVMOP_get_mem_type is not expected to fail, "This call failing is
      indication of something going quite wrong and it would be good to know
      about this." [1]
      
      Let's add a pr_warn_once().
      
      Link: https://lkml.kernel.org/r/3b935aa0-6d85-0bcd-100e-15098add3c4c@oracle.com [1]
      Link: https://lkml.kernel.org/r/20211005121430.30136-4-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Suggested-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      934fadf4
    • David Hildenbrand's avatar
      x86/xen: simplify xen_oldmem_pfn_is_ram() · d452a489
      David Hildenbrand authored
      Let's simplify return handling.
      
      Link: https://lkml.kernel.org/r/20211005121430.30136-3-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d452a489
    • David Hildenbrand's avatar
      x86/xen: update xen_oldmem_pfn_is_ram() documentation · 434b90f3
      David Hildenbrand authored
      After removing /dev/kmem, sanitizing /proc/kcore and handling /dev/mem,
      this series tackles the last sane way how a VM could accidentially
      access logically unplugged memory managed by a virtio-mem device:
      /proc/vmcore
      
      When dumping memory via "makedumpfile", PG_offline pages, used by
      virtio-mem to flag logically unplugged memory, are already properly
      excluded; however, especially when accessing/copying /proc/vmcore "the
      usual way", we can still end up reading logically unplugged memory part
      of a virtio-mem device.
      
      Patch #1-#3 are cleanups.  Patch #4 extends the existing
      oldmem_pfn_is_ram mechanism.  Patch #5-#7 are virtio-mem refactorings
      for patch #8, which implements the virtio-mem logic to query the state
      of device blocks.
      
      Patch #8:
       "Although virtio-mem currently supports reading unplugged memory in the
        hypervisor, this will change in the future, indicated to the device
        via a new feature flag. We similarly sanitized /proc/kcore access
        recently.
        [...]
        Distributions that support virtio-mem+kdump have to make sure that the
        virtio_mem module will be part of the kdump kernel or the kdump
        initrd; dracut was recently [2] extended to include virtio-mem in the
        generated initrd. As long as no special kdump kernels are used, this
        will automatically make sure that virtio-mem will be around in the
        kdump initrd and sanitize /proc/vmcore access -- with dracut"
      
      This is the last remaining bit to support
      VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE [3] in the Linux implementation of
      virtio-mem.
      
      Note: this is best-effort.  We'll never be able to control what runs
      inside the second kernel, really, but we also don't have to care: we
      only care about sane setups where we don't want our VM getting zapped
      once we touch the wrong memory location while dumping.  While we usually
      expect sane setups to use "makedumfile", nothing really speaks against
      just copying /proc/vmcore, especially in environments where HWpoisioning
      isn't typically expected.  Also, we really don't want to put all our
      trust completely on the memmap, so sanitizing also makes sense when just
      using "makedumpfile".
      
      [1] https://lkml.kernel.org/r/20210526093041.8800-1-david@redhat.com
      [2] https://github.com/dracutdevs/dracut/pull/1157
      [3] https://lists.oasis-open.org/archives/virtio-comment/202109/msg00021.html
      
      This patch (of 9):
      
      The callback is only used for the vmcore nowadays.
      
      Link: https://lkml.kernel.org/r/20211005121430.30136-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20211005121430.30136-2-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrvsky@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      434b90f3
    • Florian Weimer's avatar
      procfs: do not list TID 0 in /proc/<pid>/task · 0658a096
      Florian Weimer authored
      If a task exits concurrently, task_pid_nr_ns may return 0.
      
      [akpm@linux-foundation.org: coding style tweaks]
      [adobriyan@gmail.com: test that /proc/*/task doesn't contain "0"]
        Link: https://lkml.kernel.org/r/YV88AnVzHxPafQ9o@localhost.localdomain
      
      Link: https://lkml.kernel.org/r/8735pn5dx7.fsf@oldenburg.str.redhat.com
      
      
      Signed-off-by: default avatarFlorian Weimer <fweimer@redhat.com>
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0658a096
    • zhangyiru's avatar
      mm,hugetlb: remove mlock ulimit for SHM_HUGETLB · 83c1fd76
      zhangyiru authored
      Commit 21a3c273 ("mm, hugetlb: add thread name and pid to
      SHM_HUGETLB mlock rlimit warning") marked this as deprecated in 2012,
      but it is not deleted yet.
      
      Mike says he still sees that message in log files on occasion, so maybe we
      should preserve this warning.
      
      Also remove hugetlbfs related user_shm_unlock in ipc/shm.c and remove the
      user_shm_unlock after out.
      
      Link: https://lkml.kernel.org/r/20211103105857.25041-1-zhangyiru3@huawei.com
      
      
      Signed-off-by: default avatarzhangyiru <zhangyiru3@huawei.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Liu Zixian <liuzixian4@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: wuxu.wu <wuxu.wu@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83c1fd76
    • Johannes Weiner's avatar
      vfs: keep inodes with page cache off the inode shrinker LRU · 51b8c1fe
      Johannes Weiner authored
      Historically (pre-2.5), the inode shrinker used to reclaim only empty
      inodes and skip over those that still contained page cache.  This caused
      problems on highmem hosts: struct inode could put fill lowmem zones
      before the cache was getting reclaimed in the highmem zones.
      
      To address this, the inode shrinker started to strip page cache to
      facilitate reclaiming lowmem.  However, this comes with its own set of
      problems: the shrinkers may drop actively used page cache just because
      the inodes are not currently open or dirty - think working with a large
      git tree.  It further doesn't respect cgroup memory protection settings
      and can cause priority inversions between containers.
      
      Nowadays, the page cache also holds non-resident info for evicted cache
      pages in order to detect refaults.  We've come to rely heavily on this
      data inside reclaim for protecting the cache workingset and driving swap
      behavior.  We also use it to quantify and report workload health through
      psi.  The latter in turn is used for fleet health monitoring, as well as
      driving automated memory sizing of workloads and containers, proactive
      reclaim and memory offloading schemes.
      
      The consequences of dropping page cache prematurely is that we're seeing
      subtle and not-so-subtle failures in all of the above-mentioned
      scenarios, with the workload generally entering unexpected thrashing
      states while losing the ability to reliably detect it.
      
      To fix this on non-highmem systems at least, going back to rotating
      inodes on the LRU isn't feasible.  We've tried (commit a76cf1a4
      ("mm: don't reclaim inodes with many attached pages")) and failed
      (commit 69056ee6 ("Revert "mm: don't reclaim inodes with many
      attached pages"")).
      
      The issue is mostly that shrinker pools attract pressure based on their
      size, and when objects get skipped the shrinkers remember this as
      deferred reclaim work.  This accumulates excessive pressure on the
      remaining inodes, and we can quickly eat into heavily used ones, or
      dirty ones that require IO to reclaim, when there potentially is plenty
      of cold, clean cache around still.
      
      Instead, this patch keeps populated inodes off the inode LRU in the
      first place - just like an open file or dirty state would.  An otherwise
      clean and unused inode then gets queued when the last cache entry
      disappears.  This solves the problem without reintroducing the reclaim
      issues, and generally is a bit more scalable than having to wade through
      potentially hundreds of thousands of busy inodes.
      
      Locking is a bit tricky because the locks protecting the inode state
      (i_lock) and the inode LRU (lru_list.lock) don't nest inside the
      irq-safe page cache lock (i_pages.xa_lock).  Page cache deletions are
      serialized through i_lock, taken before the i_pages lock, to make sure
      depopulated inodes are queued reliably.  Additions may race with
      deletions, but we'll check again in the shrinker.  If additions race
      with the shrinker itself, we're protected by the i_lock: if find_inode()
      or iput() win, the shrinker will bail on the elevated i_count or
      I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
      will set I_FREEING and inhibit further igets(), which will cause the
      other side to create a new instance of the inode instead.
      
      Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
      
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51b8c1fe
  2. Nov 07, 2021
    • Changbin Du's avatar
      mm/damon: remove return value from before_terminate callback · 658f9ae7
      Changbin Du authored
      Since the return value of 'before_terminate' callback is never used, we
      make it have no return value.
      
      Link: https://lkml.kernel.org/r/20211029005023.8895-1-changbin.du@gmail.com
      
      
      Signed-off-by: default avatarChangbin Du <changbin.du@gmail.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      658f9ae7
    • Colin Ian King's avatar
    • Changbin Du's avatar
      mm/damon: simplify stop mechanism · 0f91d133
      Changbin Du authored
      A kernel thread can exit gracefully with kthread_stop().  So we don't
      need a new flag 'kdamond_stop'.  And to make sure the task struct is not
      freed when accessing it, get reference to it before termination.
      
      Link: https://lkml.kernel.org/r/20211027130517.4404-1-changbin.du@gmail.com
      
      
      Signed-off-by: default avatarChangbin Du <changbin.du@gmail.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f91d133
    • SeongJae Park's avatar
      Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions · 0d16cfd4
      SeongJae Park authored
      Some descriptions of page flags in 'pagemap.rst' are written in
      assumption of none-rst, which respects every new line, as below:
      
          7 - SLAB
             page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
             When compound page is used, SLUB/SLQB will only set this flag on the head
      
      Because rst ignores the new line between the first sentence and second
      sentence, resulting html looks a little bit weird, as below.
      
          7 - SLAB
          page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator When
                                                                             ^
          compound page is used, SLUB/SLQB will only set this flag on the head
          page; SLOB will not flag it at all.
      
      This change makes it more natural and consistent with other parts in the
      rendered version.
      
      Link: https://lkml.kernel.org/r/20211022090311.3856-5-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d16cfd4
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/start: simplify the content · b1eee3c5
      SeongJae Park authored
      Information in 'TL; DR' section of 'Getting Started' is duplicated in
      other parts of the doc.  It is also asking readers to visit the access
      pattern visualizations gallery web site to show the results of example
      visualization commands, while the users of the commands can use terminal
      output.
      
      To make the doc simple, this removes the duplicated 'TL; DR' section and
      replaces the visualization example commands with versions using terminal
      outputs.
      
      Link: https://lkml.kernel.org/r/20211022090311.3856-4-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1eee3c5
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/start: fix a wrong link · 49ce7dee
      SeongJae Park authored
      The 'Getting Started' of DAMON is providing a link to DAMON's user
      interface document while saying about its user space tool's detailed
      usages.  This fixes the link.
      
      Link: https://lkml.kernel.org/r/20211022090311.3856-3-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49ce7dee
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/start: fix wrong example commands · 82e3fff5
      SeongJae Park authored
      Patch series "Fix trivial nits in Documentation/admin-guide/mm".
      
      This patchset fixes trivial nits in admin guide documents for DAMON and
      pagemap.
      
      This patch (of 4):
      
      Some of the example commands in DAMON getting started guide are
      outdated, missing sudo, or just wrong.  This fixes those.
      
      Link: https://lkml.kernel.org/r/20211022090311.3856-2-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82e3fff5
    • Xin Hao's avatar
      mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on · b5ca3e83
      Xin Hao authored
      When the ctx->adaptive_targets list is empty, I did some test on
      monitor_on interface like this.
      
          # cat /sys/kernel/debug/damon/target_ids
          #
          # echo on > /sys/kernel/debug/damon/monitor_on
          # damon: kdamond (5390) starts
      
      Though the ctx->adaptive_targets list is empty, but the kthread_run
      still be called, and the kdamond.x thread still be created, this is
      meaningless.
      
      So there adds a judgment in 'dbgfs_monitor_on_write', if the
      ctx->adaptive_targets list is empty, return -EINVAL.
      
      Link: https://lkml.kernel.org/r/0a60a6e8ec9d71989e0848a4dc3311996ca3b5d4.1634720326.git.xhao@linux.alibaba.com
      
      
      Signed-off-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5ca3e83
    • Xin Hao's avatar
      mm/damon: remove unnecessary variable initialization · a460a360
      Xin Hao authored
      Patch series "mm/damon: Fix some small bugs", v4.
      
      This patch (of 2):
      
      In 'damon_va_apply_three_regions' there is no need to set variable 'i'
      to zero.
      
      Link: https://lkml.kernel.org/r/b7df8d3dad0943a37e01f60c441b1968b2b20354.1634720326.git.xhao@linux.alibaba.com
      Link: https://lkml.kernel.org/r/cover.1634720326.git.xhao@linux.alibaba.com
      
      
      Signed-off-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a460a360
    • SeongJae Park's avatar
      Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM · bec976b6
      SeongJae Park authored
      This adds an admin-guide document for DAMON-based Reclamation.
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-16-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bec976b6
    • SeongJae Park's avatar
      mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) · 43b0536c
      SeongJae Park authored
      This implements a new kernel subsystem that finds cold memory regions
      using DAMON and reclaims those immediately.  It is intended to be used
      as proactive lightweigh reclamation logic for light memory pressure.
      For heavy memory pressure, it could be inactivated and fall back to the
      traditional page-scanning based reclamation.
      
      It's implemented on top of DAMON framework to use the DAMON-based
      Operation Schemes (DAMOS) feature.  It utilizes all the DAMOS features
      including speed limit, prioritization, and watermarks.
      
      It could be enabled and tuned in boot time via the kernel boot
      parameter, and in run time via its module parameters
      ('/sys/module/damon_reclaim/parameters/') interface.
      
      [yangyingliang@huawei.com: fix error return code in damon_reclaim_turn()]
        Link: https://lkml.kernel.org/r/20211025124500.2758060-1-yangyingliang@huawei.com
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-15-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43b0536c
    • SeongJae Park's avatar
      selftests/damon: support watermarks · 1dc90ccd
      SeongJae Park authored
      This updates DAMON selftests for 'schemes' debugfs file to reflect the
      changes in the format.
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-14-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1dc90ccd
    • SeongJae Park's avatar
      mm/damon/dbgfs: support watermarks · ae666a6d
      SeongJae Park authored
      This updates DAMON debugfs interface to support the watermarks based
      schemes activation.  For this, now 'schemes' file receives five more
      values.
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-13-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae666a6d
    • SeongJae Park's avatar
      mm/damon/schemes: activate schemes based on a watermarks mechanism · ee801b7d
      SeongJae Park authored
      DAMON-based operation schemes need to be manually turned on and off.  In
      some use cases, however, the condition for turning a scheme on and off
      would depend on the system's situation.  For example, schemes for
      proactive pages reclamation would need to be turned on when some memory
      pressure is detected, and turned off when the system has enough free
      memory.
      
      For easier control of schemes activation based on the system situation,
      this introduces a watermarks-based mechanism.  The client can describe
      the watermark metric (e.g., amount of free memory in the system),
      watermark check interval, and three watermarks, namely high, mid, and
      low.  If the scheme is deactivated, it only gets the metric and compare
      that to the three watermarks for every check interval.  If the metric is
      higher than the high watermark, the scheme is deactivated.  If the
      metric is between the mid watermark and the low watermark, the scheme is
      activated.  If the metric is lower than the low watermark, the scheme is
      deactivated again.  This is to allow users fall back to traditional
      page-granularity mechanisms.
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-12-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee801b7d
    • SeongJae Park's avatar
      tools/selftests/damon: update for regions prioritization of schemes · 5a0d6a08
      SeongJae Park authored
      This updates the DAMON selftests for 'schemes' debugfs file, as the file
      format is updated.
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-11-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a0d6a08
    • SeongJae Park's avatar
      mm/damon/dbgfs: support prioritization weights · f4a68b4a
      SeongJae Park authored
      This allows DAMON debugfs interface users set the prioritization weights
      by putting three more numbers to the 'schemes' file.
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-10-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4a68b4a
    • SeongJae Park's avatar
      mm/damon/vaddr,paddr: support pageout prioritization · 198f0f4c
      SeongJae Park authored
      This makes the default monitoring primitives for virtual address spaces
      and the physical address sapce to support memory regions prioritization
      for 'PAGEOUT' DAMOS action.  It calculates hotness of each region as
      weighted sum of 'nr_accesses' and 'age' of the region and get the
      priority score as reverse of the hotness, so that cold regions can be
      paged out first.
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-9-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      198f0f4c
    • SeongJae Park's avatar
      mm/damon/schemes: prioritize regions within the quotas · 38683e00
      SeongJae Park authored
      This makes DAMON apply schemes to regions having higher priority first,
      if it cannot apply schemes to all regions due to the quotas.
      
      The prioritization function should be implemented in the monitoring
      primitives.  Those would commonly calculate the priority of the region
      using attributes of regions, namely 'size', 'nr_accesses', and 'age'.
      For example, some primitive would calculate the priority of each region
      using a weighted sum of 'nr_accesses' and 'age' of the region.
      
      The optimal weights would depend on give environments, so this makes
      those customizable.  Nevertheless, the score calculation functions are
      only encouraged to respect the weights, not mandated.
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-8-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38683e00
    • SeongJae Park's avatar
      mm/damon/selftests: support schemes quotas · a2cb4dd0
      SeongJae Park authored
      This updates DAMON selftests to support updated schemes debugfs file
      format for the quotas.
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-7-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2cb4dd0
    • SeongJae Park's avatar
      mm/damon/dbgfs: support quotas of schemes · d7d0ec85
      SeongJae Park authored
      This makes the debugfs interface of DAMON support the scheme quotas by
      chaning the format of the input for the schemes file.
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-6-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7d0ec85
    • SeongJae Park's avatar
      mm/damon/schemes: implement time quota · 1cd24303
      SeongJae Park authored
      The size quota feature of DAMOS is useful for IO resource-critical
      systems, but not so intuitive for CPU time-critical systems.  Systems
      using zram or zswap-like swap device would be examples.
      
      To provide another intuitive ways for such systems, this implements
      time-based quota for DAMON-based Operation Schemes.  If the quota is
      set, DAMOS tries to use only up to the user-defined quota of CPU time
      within a given time window.
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-5-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1cd24303
    • SeongJae Park's avatar
      mm/damon/schemes: skip already charged targets and regions · 50585192
      SeongJae Park authored
      If DAMOS has stopped applying action in the middle of a group of memory
      regions due to its size quota, it starts the work again from the
      beginning of the address space in the next charge window.  If there is a
      huge memory region at the beginning of the address space and it fulfills
      the scheme's target data access pattern always, the action will applied
      to only the region.
      
      This mitigates the case by skipping memory regions that charged in
      current charge window at the beginning of next charge window.
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-4-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50585192
    • SeongJae Park's avatar
      mm/damon/schemes: implement size quota for schemes application speed control · 2b8a248d
      SeongJae Park authored
      There could be arbitrarily large memory regions fulfilling the target
      data access pattern of a DAMON-based operation scheme.  In the case,
      applying the action of the scheme could incur too high overhead.  To
      provide an intuitive way for avoiding it, this implements a feature
      called size quota.  If the quota is set, DAMON tries to apply the action
      only up to the given amount of memory regions within a given time
      window.
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-3-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b8a248d
    • SeongJae Park's avatar
      mm/damon/paddr: support the pageout scheme · 57223ac2
      SeongJae Park authored
      Introduction
      ============
      
      This patchset 1) makes the engine for general data access
      pattern-oriented memory management (DAMOS) be more useful for production
      environments, and 2) implements a static kernel module for lightweight
      proactive reclamation using the engine.
      
      Proactive Reclamation
      ---------------------
      
      On general memory over-committed systems, proactively reclaiming cold
      pages helps saving memory and reducing latency spikes that incurred by
      the direct reclaim or the CPU consumption of kswapd, while incurring
      only minimal performance degradation[2].
      
      A Free Pages Reporting[8] based memory over-commit virtualization system
      would be one more specific use case.  In the system, the guest VMs
      reports their free memory to host, and the host reallocates the reported
      memory to other guests.  As a result, the system's memory utilization
      can be maximized.  However, the guests could be not so memory-frugal,
      because some kernel subsystems and user-space applications are designed
      to use as much memory as available.  Then, guests would report only
      small amount of free memory to host, results in poor memory utilization.
      Running the proactive reclamation in such guests could help mitigating
      this problem.
      
      Google has also implemented this idea and using it in their data center.
      They further proposed upstreaming it in LSFMM'19, and "the general
      consensus was that, while this sort of proactive reclaim would be useful
      for a number of users, the cost of this particular solution was too high
      to consider merging it upstream"[3].  The cost mainly comes from the
      coldness tracking.  Roughly speaking, the implementation periodically
      scans the 'Accessed' bit of each page.  For the reason, the overhead
      linearly increases as the size of the memory and the scanning frequency
      grows.  As a result, Google is known to dedicating one CPU for the work.
      That's a reasonable option to someone like Google, but it wouldn't be so
      to some others.
      
      DAMON and DAMOS: An engine for data access pattern-oriented memory management
      -----------------------------------------------------------------------------
      
      DAMON[4] is a framework for general data access monitoring.  Its
      adaptive monitoring overhead control feature minimizes its monitoring
      overhead.  It also let the upper-bound of the overhead be configurable
      by clients, regardless of the size of the monitoring target memory.
      While monitoring 70 GiB memory of a production system every 5
      milliseconds, it consumes less than 1% single CPU time.  For this, it
      could sacrify some of the quality of the monitoring results.
      Nevertheless, the lower-bound of the quality is configurable, and it
      uses a best-effort algorithm for better quality.  Our test results[5]
      show the quality is practical enough.  From the production system
      monitoring, we were able to find a 4 KiB region in the 70 GiB memory
      that shows highest access frequency.
      
      We normally don't monitor the data access pattern just for fun but to
      improve something like memory management.  Proactive reclamation is one
      such usage.  For such general cases, DAMON provides a feature called
      DAMon-based Operation Schemes (DAMOS)[6].  It makes DAMON an engine for
      general data access pattern oriented memory management.  Using this,
      clients can ask DAMON to find memory regions of specific data access
      pattern and apply some memory management action (e.g., page out, move to
      head of the LRU list, use huge page, ...).  We call the request
      'scheme'.
      
      Proactive Reclamation on top of DAMON/DAMOS
      -------------------------------------------
      
      Therefore, by using DAMON for the cold pages detection, the proactive
      reclamation's monitoring overhead issue can be solved.  Actually, we
      previously implemented a version of proactive reclamation using DAMOS
      and achieved noticeable improvements with our evaluation setup[5].
      Nevertheless, it more for a proof-of-concept, rather than production
      uses.  It supports only virtual address spaces of processes, and require
      additional tuning efforts for given workloads and the hardware.  For the
      tuning, we introduced a simple auto-tuning user space tool[8].  Google
      is also known to using a ML-based similar approach for their fleets[2].
      But, making it just works with intuitive knobs in the kernel would be
      helpful for general users.
      
      To this end, this patchset improves DAMOS to be ready for such
      production usages, and implements another version of the proactive
      reclamation, namely DAMON_RECLAIM, on top of it.
      
      DAMOS Improvements: Aggressiveness Control, Prioritization, and Watermarks
      --------------------------------------------------------------------------
      
      First of all, the current version of DAMOS supports only virtual address
      spaces.  This patchset makes it supports the physical address space for
      the page out action.
      
      Next major problem of the current version of DAMOS is the lack of the
      aggressiveness control, which can results in arbitrary overhead.  For
      example, if huge memory regions having the data access pattern of
      interest are found, applying the requested action to all of the regions
      could incur significant overhead.  It can be controlled by tuning the
      target data access pattern with manual or automated approaches[2,7].
      But, some people would prefer the kernel to just work with only
      intuitive tuning or default values.
      
      For such cases, this patchset implements a safeguard, namely time/size
      quota.  Using this, the clients can specify up to how much time can be
      used for applying the action, and/or up to how much memory regions the
      action can be applied within a user-specified time duration.  A followup
      question is, to which memory regions should the action applied within
      the limits? We implement a simple regions prioritization mechanism for
      each action and make DAMOS to apply the action to high priority regions
      first.  It also allows clients tune the prioritization mechanism to use
      different weights for size, access frequency, and age of memory regions.
      This means we could use not only LRU but also LFU or some fancy
      algorithms like CAR[9] with lightweight overhead.
      
      Though DAMON is lightweight, someone would want to remove even the cold
      pages monitoring overhead when it is unnecessary.  Currently, it should
      manually turned on and off by clients, but some clients would simply
      want to turn it on and off based on some metrics like free memory ratio
      or memory fragmentation.  For such cases, this patchset implements a
      watermarks-based automatic activation feature.  It allows the clients
      configure the metric of their interest, and three watermarks of the
      metric.  If the metric is higher than the high watermark or lower than
      the low watermark, the scheme is deactivated.  If the metric is lower
      than the mid watermark but higher than the low watermark, the scheme is
      activated.
      
      DAMON-based Reclaim
      -------------------
      
      Using the improved version of DAMOS, this patchset implements a static
      kernel module called 'damon_reclaim'.  It finds memory regions that
      didn't accessed for specific time duration and page out.  Consuming too
      much CPU for the paging out operations, or doing pageout too frequently
      can be critical for systems configuring their swap devices with
      software-defined in-memory block devices like zram/zswap or total number
      of writes limited devices like SSDs, respectively.  To avoid the
      problems, the time/size quotas can be configured.  Under the quotas, it
      pages out memory regions that didn't accessed longer first.  Also, to
      remove the monitoring overhead under peaceful situation, and to fall
      back to the LRU-list based page granularity reclamation when it doesn't
      make progress, the three watermarks based activation mechanism is used,
      with the free memory ratio as the watermark metric.
      
      For convenient configurations, it provides several module parameters.
      Using these, sysadmins can enable/disable it, and tune its parameters
      including the coldness identification time threshold, the time/size
      quotas and the three watermarks.
      
      Evaluation
      ==========
      
      In short, DAMON_RECLAIM with 50ms/s time quota and regions
      prioritization on v5.15-rc5 Linux kernel with ZRAM swap device achieves
      38.58% memory saving with only 1.94% runtime overhead.  For this,
      DAMON_RECLAIM consumes only 4.97% of single CPU time.
      
      Setup
      -----
      
      We evaluate DAMON_RECLAIM to show how each of the DAMOS improvements
      make effect.  For this, we measure DAMON_RECLAIM's CPU consumption,
      entire system memory footprint, total number of major page faults, and
      runtime of 24 realistic workloads in PARSEC3 and SPLASH-2X benchmark
      suites on my QEMU/KVM based virtual machine.  The virtual machine runs
      on an i3.metal AWS instance, has 130GiB memory, and runs a linux kernel
      built on latest -mm tree[1] plus this patchset.  It also utilizes a 4
      GiB ZRAM swap device.  We repeats the measurement 5 times and use
      averages.
      
      [1] https://github.com/hnaz/linux-mm/tree/v5.15-rc5-mmots-2021-10-13-19-55
      
      Detailed Results
      ----------------
      
      The results are summarized in the below table.
      
      With coldness identification threshold of 5 seconds, DAMON_RECLAIM
      without the time quota-based speed limit achieves 47.21% memory saving,
      but incur 4.59% runtime slowdown to the workloads on average.  For this,
      DAMON_RECLAIM consumes about 11.28% single CPU time.
      
      Applying time quotas of 200ms/s, 50ms/s, and 10ms/s without the regions
      prioritization reduces the slowdown to 4.89%, 2.65%, and 1.5%,
      respectively.  Time quota of 200ms/s (20%) makes no real change compared
      to the quota unapplied version, because the quota unapplied version
      consumes only 11.28% CPU time.  DAMON_RECLAIM's CPU utilization also
      similarly reduced: 11.24%, 5.51%, and 2.01% of single CPU time.  That
      is, the overhead is proportional to the speed limit.  Nevertheless, it
      also reduces the memory saving because it becomes less aggressive.  In
      detail, the three variants show 48.76%, 37.83%, and 7.85% memory saving,
      respectively.
      
      Applying the regions prioritization (page out regions that not accessed
      longer first within the time quota) further reduces the performance
      degradation.  Runtime slowdowns and total number of major page faults
      increase has been 4.89%/218,690% -> 4.39%/166,136% (200ms/s),
      2.65%/111,886% -> 1.94%/59,053% (50ms/s), and 1.5%/34,973.40% ->
      2.08%/8,781.75% (10ms/s).  The runtime under 10ms/s time quota has
      increased with prioritization, but apparently that's under the margin of
      error.
      
          time quota   prioritization  memory_saving  cpu_util  slowdown  pgmajfaults overhead
          N            N               47.21%         11.28%    4.59%     194,802%
          200ms/s      N               48.76%         11.24%    4.89%     218,690%
          50ms/s       N               37.83%         5.51%     2.65%     111,886%
          10ms/s       N               7.85%          2.01%     1.5%      34,793.40%
          200ms/s      Y               50.08%         10.38%    4.39%     166,136%
          50ms/s       Y               38.58%         4.97%     1.94%     59,053%
          10ms/s       Y               3.63%          1.73%     2.08%     8,781.75%
      
      Baseline and Complete Git Trees
      ===============================
      
      The patches are based on the latest -mm tree
      (v5.15-rc5-mmots-2021-10-13-19-55).  You can also clone the complete git tree
      from:
      
          $ git clone git://github.com/sjp38/linux -b damon_reclaim/patches/v1
      
      The web is also available:
      https://git.kernel.org/pub/scm/linux/kernel/git/sj/linux.git/tag/?h=damon_reclaim/patches/v1
      
      Sequence Of Patches
      ===================
      
      The first patch makes DAMOS support the physical address space for the
      page out action.  Following five patches (patches 2-6) implement the
      time/size quotas.  Next four patches (patches 7-10) implement the memory
      regions prioritization within the limit.  Then, three following patches
      (patches 11-13) implement the watermarks-based schemes activation.
      
      Finally, the last two patches (patches 14-15) implement and document the
      DAMON-based reclamation using the advanced DAMOS.
      
      [1] https://www.kernel.org/doc/html/v5.15-rc1/vm/damon/index.html
      [2] https://research.google/pubs/pub48551/
      [3] https://lwn.net/Articles/787611/
      [4] https://damonitor.github.io
      [5] https://damonitor.github.io/doc/html/latest/vm/damon/eval.html
      [6] https://lore.kernel.org/linux-mm/20211001125604.29660-1-sj@kernel.org/
      [7] https://github.com/awslabs/damoos
      [8] https://www.kernel.org/doc/html/latest/vm/free_page_reporting.html
      [9] https://www.usenix.org/conference/fast-04/car-clock-adaptive-replacement
      
      This patch (of 15):
      
      This makes the DAMON primitives for physical address space support the
      pageout action for DAMON-based Operation Schemes.  With this commit,
      hence, users can easily implement system-level data access-aware
      reclamations using DAMOS.
      
      [sj@kernel.org: fix missing-prototype build warning]
        Link: https://lkml.kernel.org/r/20211025064220.13904-1-sj@kernel.org
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20211019150731.16699-2-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57223ac2
    • Rongwei Wang's avatar
      mm/damon/dbgfs: remove unnecessary variables · 9210622a
      Rongwei Wang authored
      In some functions, it's unnecessary to declare 'err' and 'ret' variables
      at the same time.  This patch mainly to simplify the issue of such
      declarations by reusing one variable.
      
      Link: https://lkml.kernel.org/r/20211014073014.35754-1-sj@kernel.org
      
      
      Signed-off-by: default avatarRongwei Wang <rongwei.wang@linux.alibaba.com>
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9210622a
    • Rikard Falkeborn's avatar
      mm/damon/vaddr: constify static mm_walk_ops · 199b50f4
      Rikard Falkeborn authored
      The only usage of these structs is to pass their addresses to
      walk_page_range(), which takes a pointer to const mm_walk_ops as
      argument.  Make them const to allow the compiler to put them in
      read-only memory.
      
      Link: https://lkml.kernel.org/r/20211014075042.17174-2-rikard.falkeborn@gmail.com
      
      
      Signed-off-by: default avatarRikard Falkeborn <rikard.falkeborn@gmail.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      199b50f4
    • SeongJae Park's avatar
      Docs/DAMON: document physical memory monitoring support · c6380721
      SeongJae Park authored
      This updates the DAMON documents for the physical memory address space
      monitoring support.
      
      Link: https://lkml.kernel.org/r/20211012205711.29216-8-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6380721
    • SeongJae Park's avatar
      mm/damon/dbgfs: support physical memory monitoring · c026291a
      SeongJae Park authored
      This makes the 'damon-dbgfs' to support the physical memory monitoring,
      in addition to the virtual memory monitoring.
      
      Users can do the physical memory monitoring by writing a special
      keyword, 'paddr' to the 'target_ids' debugfs file.  Then, DAMON will
      check the special keyword and configure the monitoring context to run
      with the primitives for the physical address space.
      
      Unlike the virtual memory monitoring, the monitoring target region will
      not be automatically set.  Therefore, users should also set the
      monitoring target address region using the 'init_regions' debugfs file.
      
      Also, note that the physical memory monitoring will not automatically
      terminated.  The user should explicitly turn off the monitoring by
      writing 'off' to the 'monitor_on' debugfs file.
      
      Link: https://lkml.kernel.org/r/20211012205711.29216-7-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c026291a
    • SeongJae Park's avatar
      mm/damon: implement primitives for physical address space monitoring · a28397be
      SeongJae Park authored
      This implements the monitoring primitives for the physical memory
      address space.  Internally, it uses the PTE Accessed bit, similar to
      that of the virtual address spaces monitoring primitives.  It supports
      only user memory pages, as idle pages tracking does.  If the monitoring
      target physical memory address range contains non-user memory pages,
      access check of the pages will do nothing but simply treat the pages as
      not accessed.
      
      Link: https://lkml.kernel.org/r/20211012205711.29216-6-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a28397be
    • SeongJae Park's avatar
      mm/damon/vaddr: separate commonly usable functions · 46c3a0ac
      SeongJae Park authored
      This moves functions in the default virtual address spaces monitoring
      primitives that commonly usable from other address spaces like physical
      address space into a header file.  Those will be reused by the physical
      address space monitoring primitives which will be implemented by the
      following commit.
      
      [sj@kernel.org: include 'highmem.h' to fix a build failure]
        Link: https://lkml.kernel.org/r/20211014110848.5204-1-sj@kernel.org
      
      Link: https://lkml.kernel.org/r/20211012205711.29216-5-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46c3a0ac
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon: document 'init_regions' feature · c2fe4987
      SeongJae Park authored
      This adds description of the 'init_regions' feature in the DAMON usage
      document.
      
      Link: https://lkml.kernel.org/r/20211012205711.29216-4-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2fe4987
    • SeongJae Park's avatar
      mm/damon/dbgfs-test: add a unit test case for 'init_regions' · 1c2e11bf
      SeongJae Park authored
      This adds another test case for the new feature, 'init_regions'.
      
      Link: https://lkml.kernel.org/r/20211012205711.29216-3-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Reviewed-by: default avatarBrendan Higgins <brendanhiggins@google.com>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c2e11bf