Skip to content
  1. Nov 07, 2021
    • SeongJae Park's avatar
      mm/damon/core: implement DAMON-based Operation Schemes (DAMOS) · 1f366e42
      SeongJae Park authored
      
      
      In many cases, users might use DAMON for simple data access aware memory
      management optimizations such as applying an operation scheme to a
      memory region of a specific size having a specific access frequency for
      a specific time.  For example, "page out a memory region larger than 100
      MiB but having a low access frequency more than 10 minutes", or "Use THP
      for a memory region larger than 2 MiB having a high access frequency for
      more than 2 seconds".
      
      Most simple form of the solution would be doing offline data access
      pattern profiling using DAMON and modifying the application source code
      or system configuration based on the profiling results.  Or, developing
      a daemon constructed with two modules (one for access monitoring and the
      other for applying memory management actions via mlock(), madvise(),
      sysctl, etc) is imaginable.
      
      To avoid users spending their time for implementation of such simple
      data access monitoring-based operation schemes, this makes DAMON to
      handle such schemes directly.  With this change, users can simply
      specify their desired schemes to DAMON.  Then, DAMON will automatically
      apply the schemes to the user-specified target processes.
      
      Each of the schemes is composed with conditions for filtering of the
      target memory regions and desired memory management action for the
      target.  Specifically, the format is::
      
          <min/max size> <min/max access frequency> <min/max age> <action>
      
      The filtering conditions are size of memory region, number of accesses
      to the region monitored by DAMON, and the age of the region.  The age of
      region is incremented periodically but reset when its addresses or
      access frequency has significantly changed or the action of a scheme was
      applied.  For the action, current implementation supports a few of
      madvise()-like hints, ``WILLNEED``, ``COLD``, ``PAGEOUT``, ``HUGEPAGE``,
      and ``NOHUGEPAGE``.
      
      Because DAMON supports various address spaces and application of the
      actions to a monitoring target region is dependent to the type of the
      target address space, the application code should be implemented by each
      primitives and registered to the framework.  Note that this only
      implements the framework part.  Following commit will implement the
      action applications for virtual address spaces primitives.
      
      Link: https://lkml.kernel.org/r/20211001125604.29660-3-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f366e42
    • SeongJae Park's avatar
      mm/damon/core: account age of target regions · fda504fa
      SeongJae Park authored
      
      
      Patch series "Implement Data Access Monitoring-based Memory Operation Schemes".
      
      Introduction
      ============
      
      DAMON[1] can be used as a primitive for data access aware memory
      management optimizations.  For that, users who want such optimizations
      should run DAMON, read the monitoring results, analyze it, plan a new
      memory management scheme, and apply the new scheme by themselves.  Such
      efforts will be inevitable for some complicated optimizations.
      
      However, in many other cases, the users would simply want the system to
      apply a memory management action to a memory region of a specific size
      having a specific access frequency for a specific time.  For example,
      "page out a memory region larger than 100 MiB keeping only rare accesses
      more than 2 minutes", or "Do not use THP for a memory region larger than
      2 MiB rarely accessed for more than 1 seconds".
      
      To make the works easier and non-redundant, this patchset implements a
      new feature of DAMON, which is called Data Access Monitoring-based
      Operation Schemes (DAMOS).  Using the feature, users can describe the
      normal schemes in a simple way and ask DAMON to execute those on its
      own.
      
      [1] https://damonitor.github.io
      
      Evaluations
      ===========
      
      DAMOS is accurate and useful for memory management optimizations.  An
      experimental DAMON-based operation scheme for THP, 'ethp', removes
      76.15% of THP memory overheads while preserving 51.25% of THP speedup.
      Another experimental DAMON-based 'proactive reclamation' implementation,
      'prcl', reduces 93.38% of residential sets and 23.63% of system memory
      footprint while incurring only 1.22% runtime overhead in the best case
      (parsec3/freqmine).
      
      NOTE that the experimental THP optimization and proactive reclamation
      are not for production but only for proof of concepts.
      
      Please refer to the showcase web site's evaluation document[1] for
      detailed evaluation setup and results.
      
      [1] https://damonitor.github.io/doc/html/v34/vm/damon/eval.html
      
      Long-term Support Trees
      -----------------------
      
      For people who want to test DAMON but using LTS kernels, there are
      another couple of trees based on two latest LTS kernels respectively and
      containing the 'damon/master' backports.
      
      - For v5.4.y: https://git.kernel.org/sj/h/damon/for-v5.4.y
      - For v5.10.y: https://git.kernel.org/sj/h/damon/for-v5.10.y
      
      Sequence Of Patches
      ===================
      
      The 1st patch accounts age of each region.  The 2nd patch implements the
      core of the DAMON-based operation schemes feature.  The 3rd patch makes
      the default monitoring primitives for virtual address spaces to support
      the schemes.  From this point, the kernel space users can use DAMOS.
      The 4th patch exports the feature to the user space via the debugfs
      interface.  The 5th patch implements schemes statistics feature for
      easier tuning of the schemes and runtime access pattern analysis, and
      the 6th patch adds selftests for these changes.  Finally, the 7th patch
      documents this new feature.
      
      This patch (of 7):
      
      DAMON can be used for data access pattern aware memory management
      optimizations.  For that, users should run DAMON, read the monitoring
      results, analyze it, plan a new memory management scheme, and apply the
      new scheme by themselves.  It would not be too hard, but still require
      some level of effort.  For complicated cases, this effort is inevitable.
      
      That said, in many cases, users would simply want to apply an actions to
      a memory region of a specific size having a specific access frequency
      for a specific time.  For example, "page out a memory region larger than
      100 MiB but having a low access frequency more than 10 minutes", or "Use
      THP for a memory region larger than 2 MiB having a high access frequency
      for more than 2 seconds".
      
      For such optimizations, users will need to first account the age of each
      region themselves.  To reduce such efforts, this implements a simple age
      account of each region in DAMON.  For each aggregation step, DAMON
      compares the access frequency with that from last aggregation and reset
      the age of the region if the change is significant.  Else, the age is
      incremented.  Also, in case of the merge of regions, the region
      size-weighted average of the ages is set as the age of merged new
      region.
      
      Link: https://lkml.kernel.org/r/20211001125604.29660-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20211001125604.29660-2-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: David Rienjes <rientjes@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fda504fa
    • Colin Ian King's avatar
      mm/damon/core: nullify pointer ctx->kdamond with a NULL · 7ec1992b
      Colin Ian King authored
      
      
      Currently a plain integer is being used to nullify the pointer
      ctx->kdamond.  Use NULL instead.  Cleans up sparse warning:
      
        mm/damon/core.c:317:40: warning: Using plain integer as NULL pointer
      
      Link: https://lkml.kernel.org/r/20210925215908.181226-1-colin.king@canonical.com
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ec1992b
    • Changbin Du's avatar
      mm/damon: needn't hold kdamond_lock to print pid of kdamond · 42e4cef5
      Changbin Du authored
      
      
      Just get the pid by 'current->pid'.  Meanwhile, to be symmetrical make
      the 'starts' and 'finishes' logs both use debug level.
      
      Link: https://lkml.kernel.org/r/20210927232432.17750-1-changbin.du@gmail.com
      Signed-off-by: default avatarChangbin Du <changbin.du@gmail.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      42e4cef5
    • Changbin Du's avatar
      mm/damon: remove unnecessary do_exit() from kdamond · 5f7fe2b9
      Changbin Du authored
      
      
      Just return from the kthread function.
      
      Link: https://lkml.kernel.org/r/20210927232421.17694-1-changbin.du@gmail.com
      Signed-off-by: default avatarChangbin Du <changbin.du@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f7fe2b9
    • SeongJae Park's avatar
      mm/damon/core: print kdamond start log in debug mode only · 704571f9
      SeongJae Park authored
      
      
      Logging of kdamond startup is using 'pr_info()' unnecessarily.  This
      makes it to use 'pr_debug()' instead.
      
      Link: https://lkml.kernel.org/r/20210917123958.3819-6-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: SeongJae Park <sjpark@amazon.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      704571f9
    • SeongJae Park's avatar
      include/linux/damon.h: fix kernel-doc comments for 'damon_callback' · d2f272b3
      SeongJae Park authored
      
      
      A few Kernel-doc comments in 'damon.h' are broken.  This fixes them.
      
      Link: https://lkml.kernel.org/r/20210917123958.3819-5-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sjpark@amazon.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2f272b3
    • SeongJae Park's avatar
      docs/vm/damon: remove broken reference · 876d0aac
      SeongJae Park authored
      
      
      Building DAMON documents warns for a reference to nonexisting doc, as
      below:
      
          $ time make htmldocs
          [...]
          Documentation/vm/damon/index.rst:24: WARNING: toctree contains reference to nonexisting document 'vm/damon/plans'
      
      This fixes the warning by removing the wrong reference.
      
      Link: https://lkml.kernel.org/r/20210917123958.3819-4-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sjpark@amazon.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      876d0aac
    • SeongJae Park's avatar
      MAINTAINERS: update SeongJae's email address · f9803a99
      SeongJae Park authored
      
      
      This updates SeongJae's email address in MAINTAINERS file to his
      preferred one.
      
      Link: https://lkml.kernel.org/r/20210917123958.3819-3-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: SeongJae Park <sjpark@amazon.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9803a99
    • SeongJae Park's avatar
      Documentation/vm: move user guides to admin-guide/mm/ · ad782c48
      SeongJae Park authored
      
      
      Most memory management user guide documents are in 'admin-guide/mm/',
      but two of those are in 'vm/'.  This moves the two docs into
      'admin-guide/mm' for easier documents finding.
      
      Link: https://lkml.kernel.org/r/20210917123958.3819-2-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sjpark@amazon.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad782c48
    • Geert Uytterhoeven's avatar
      mm/damon: grammar s/works/work/ · f24b0626
      Geert Uytterhoeven authored
      Correct a singular versus plural grammar mistake in the help text for
      the DAMON_VADDR config symbol.
      
      Link: https://lkml.kernel.org/r/20210914073451.3883834-1-geert@linux-m68k.org
      Fixes: 3f49584b
      
       ("mm/damon: implement primitives for the virtual memory address spaces")
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: default avatarSeongJae Park <sjpark@amazon.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f24b0626
    • Marco Elver's avatar
      kfence: default to dynamic branch instead of static keys mode · 4f612ed3
      Marco Elver authored
      
      
      We have observed that on very large machines with newer CPUs, the static
      key/branch switching delay is on the order of milliseconds.  This is due
      to the required broadcast IPIs, which simply does not scale well to
      hundreds of CPUs (cores).  If done too frequently, this can adversely
      affect tail latencies of various workloads.
      
      One workaround is to increase the sample interval to several seconds,
      while decreasing sampled allocation coverage, but the problem still
      exists and could still increase tail latencies.
      
      As already noted in the Kconfig help text, there are trade-offs: at
      lower sample intervals the dynamic branch results in better performance;
      however, at very large sample intervals, the static keys mode can result
      in better performance -- careful benchmarking is recommended.
      
      Our initial benchmarking showed that with large enough sample intervals
      and workloads stressing the allocator, the static keys mode was slightly
      better.  Evaluating and observing the possible system-wide side-effects
      of the static-key-switching induced broadcast IPIs, however, was a blind
      spot (in particular on large machines with 100s of cores).
      
      Therefore, a major downside of the static keys mode is, unfortunately,
      that it is hard to predict performance on new system architectures and
      topologies, but also making conclusions about performance of new
      workloads based on a limited set of benchmarks.
      
      Most distributions will simply select the defaults, while targeting a
      large variety of different workloads and system architectures.  As such,
      the better default is CONFIG_KFENCE_STATIC_KEYS=n, and re-enabling it is
      only recommended after careful evaluation.
      
      For reference, on x86-64 the condition in kfence_alloc() generates
      exactly
      2 instructions in the kmem_cache_alloc() fast-path:
      
       | ...
       | cmpl   $0x0,0x1a8021c(%rip)  # ffffffff82d560d0 <kfence_allocation_gate>
       | je     ffffffff812d6003      <kmem_cache_alloc+0x243>
       | ...
      
      which, given kfence_allocation_gate is infrequently modified, should be
      well predicted by most CPUs.
      
      Link: https://lkml.kernel.org/r/20211019102524.2807208-2-elver@google.com
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f612ed3
    • Marco Elver's avatar
      kfence: always use static branches to guard kfence_alloc() · 07e8481d
      Marco Elver authored
      
      
      Regardless of KFENCE mode (CONFIG_KFENCE_STATIC_KEYS: either using
      static keys to gate allocations, or using a simple dynamic branch),
      always use a static branch to avoid the dynamic branch in kfence_alloc()
      if KFENCE was disabled at boot.
      
      For CONFIG_KFENCE_STATIC_KEYS=n, this now avoids the dynamic branch if
      KFENCE was disabled at boot.
      
      To simplify, also unifies the location where kfence_allocation_gate is
      read-checked to just be inline in kfence_alloc().
      
      Link: https://lkml.kernel.org/r/20211019102524.2807208-1-elver@google.com
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07e8481d
    • Marco Elver's avatar
      kfence: shorten critical sections of alloc/free · 49332956
      Marco Elver authored
      
      
      Initializing memory and setting/checking the canary bytes is relatively
      expensive, and doing so in the meta->lock critical sections extends the
      duration with preemption and interrupts disabled unnecessarily.
      
      Any reads to meta->addr and meta->size in kfence_guarded_alloc() and
      kfence_guarded_free() don't require locking meta->lock as long as the
      object is removed from the freelist: only kfence_guarded_alloc() sets
      meta->addr and meta->size after removing it from the freelist, which
      requires a preceding kfence_guarded_free() returning it to the list or
      the initial state.
      
      Therefore move reads to meta->addr and meta->size, including expensive
      memory initialization using them, out of meta->lock critical sections.
      
      Link: https://lkml.kernel.org/r/20210930153706.2105471-1-elver@google.com
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49332956
    • Marco Elver's avatar
      kfence: test: use kunit_skip() to skip tests · f51733e2
      Marco Elver authored
      
      
      Use the new kunit_skip() to skip tests if requirements were not met.  It
      makes it easier to see in KUnit's summary if there were skipped tests.
      
      Link: https://lkml.kernel.org/r/20210922182541.1372400-1-elver@google.com
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDavid Gow <davidgow@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Aleksandr Nogikh <nogikh@google.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f51733e2
    • Marco Elver's avatar
      kfence: add note to documentation about skipping covered allocations · 5cc906b4
      Marco Elver authored
      
      
      Add a note briefly mentioning the new policy about "skipping currently
      covered allocations if pool close to full." Since this has a notable
      impact on KFENCE's bug-detection ability on systems with large uptimes,
      it is worth pointing out the feature.
      
      Link: https://lkml.kernel.org/r/20210923104803.2620285-5-elver@google.com
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Aleksandr Nogikh <nogikh@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5cc906b4
    • Marco Elver's avatar
      kfence: limit currently covered allocations when pool nearly full · 08f6b106
      Marco Elver authored
      
      
      One of KFENCE's main design principles is that with increasing uptime,
      allocation coverage increases sufficiently to detect previously
      undetected bugs.
      
      We have observed that frequent long-lived allocations of the same source
      (e.g.  pagecache) tend to permanently fill up the KFENCE pool with
      increasing system uptime, thus breaking the above requirement.  The
      workaround thus far had been increasing the sample interval and/or
      increasing the KFENCE pool size, but is no reliable solution.
      
      To ensure diverse coverage of allocations, limit currently covered
      allocations of the same source once pool utilization reaches 75%
      (configurable via `kfence.skip_covered_thresh`) or above.  The effect is
      retaining reasonable allocation coverage when the pool is close to full.
      
      A side-effect is that this also limits frequent long-lived allocations
      of the same source filling up the pool permanently.
      
      Uniqueness of an allocation for coverage purposes is based on its
      (partial) allocation stack trace (the source).  A Counting Bloom filter
      is used to check if an allocation is covered; if the allocation is
      currently covered, the allocation is skipped by KFENCE.
      
      Testing was done using:
      
      	(a) a synthetic workload that performs frequent long-lived
      	    allocations (default config values; sample_interval=1;
      	    num_objects=63), and
      
      	(b) normal desktop workloads on an otherwise idle machine where
      	    the problem was first reported after a few days of uptime
      	    (default config values).
      
      In both test cases the sampled allocation rate no longer drops to zero
      at any point.  In the case of (b) we observe (after 2 days uptime) 15%
      unique allocations in the pool, 77% pool utilization, with 20% "skipped
      allocations (covered)".
      
      [elver@google.com: simplify and just use hash_32(), use more random stack_hash_seed]
        Link: https://lkml.kernel.org/r/YU3MRGaCaJiYht5g@elver.google.com
      [elver@google.com: fix 32 bit]
      
      Link: https://lkml.kernel.org/r/20210923104803.2620285-4-elver@google.com
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Aleksandr Nogikh <nogikh@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08f6b106
    • Marco Elver's avatar
      kfence: move saving stack trace of allocations into __kfence_alloc() · a9ab52bb
      Marco Elver authored
      
      
      Move the saving of the stack trace of allocations into __kfence_alloc(),
      so that the stack entries array can be used outside of
      kfence_guarded_alloc() and we avoid potentially unwinding the stack
      multiple times.
      
      Link: https://lkml.kernel.org/r/20210923104803.2620285-3-elver@google.com
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Aleksandr Nogikh <nogikh@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9ab52bb
    • Marco Elver's avatar
      kfence: count unexpectedly skipped allocations · 9a19aeb5
      Marco Elver authored
      
      
      Maintain a counter to count allocations that are skipped due to being
      incompatible (oversized, incompatible gfp flags) or no capacity.
      
      This is to compute the fraction of allocations that could not be
      serviced by KFENCE, which we expect to be rare.
      
      Link: https://lkml.kernel.org/r/20210923104803.2620285-2-elver@google.com
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Aleksandr Nogikh <nogikh@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a19aeb5
    • Marco Elver's avatar
      stacktrace: move filter_irq_stacks() to kernel/stacktrace.c · f39f21b3
      Marco Elver authored
      
      
      filter_irq_stacks() has little to do with the stackdepot implementation,
      except that it is usually used by users (such as KASAN) of stackdepot to
      reduce the stack trace.
      
      However, filter_irq_stacks() itself is not useful without a stack trace
      as obtained by stack_trace_save() and friends.
      
      Therefore, move filter_irq_stacks() to kernel/stacktrace.c, so that new
      users of filter_irq_stacks() do not have to start depending on
      STACKDEPOT only for filter_irq_stacks().
      
      Link: https://lkml.kernel.org/r/20210923104803.2620285-1-elver@google.com
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Aleksandr Nogikh <nogikh@google.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f39f21b3
    • Mianhan Liu's avatar
      include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h · a1554c00
      Mianhan Liu authored
      
      
      nr_free_buffer_pages could be exposed through mm.h instead of swap.h.
      The advantage of this change is that it can reduce the obsolete
      includes.  For example, net/ipv4/tcp.c wouldn't need swap.h any more
      since it has already included mm.h.  Similarly, after checking all the
      other files, it comes that tcp.c, udp.c meter.c ,...  follow the same
      rule, so these files can have swap.h removed too.
      
      Moreover, after preprocessing all the files that use
      nr_free_buffer_pages, it turns out that those files have already
      included mm.h.Thus, we can move nr_free_buffer_pages from swap.h to mm.h
      safely.  This change will not affect the compilation of other files.
      
      Link: https://lkml.kernel.org/r/20210912133640.1624-1-liumh1@shanghaitech.edu.cn
      Signed-off-by: default avatarMianhan Liu <liumh1@shanghaitech.edu.cn>
      Cc: Jakub Kicinski <kuba@kernel.org>
      CC: Ulf Hansson <ulf.hansson@linaro.org>
      Cc: "David S . Miller" <davem@davemloft.net>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Pravin B Shelar <pshelar@ovn.org>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1554c00
    • Stephen Kitt's avatar
      mm: remove HARDENED_USERCOPY_FALLBACK · 53944f17
      Stephen Kitt authored
      This has served its purpose and is no longer used.  All usercopy
      violations appear to have been handled by now, any remaining instances
      (or new bugs) will cause copies to be rejected.
      
      This isn't a direct revert of commit 2d891fbc
      
       ("usercopy: Allow
      strict enforcement of whitelists"); since usercopy_fallback is
      effectively 0, the fallback handling is removed too.
      
      This also removes the usercopy_fallback module parameter on slab_common.
      
      Link: https://github.com/KSPP/linux/issues/153
      Link: https://lkml.kernel.org/r/20210921061149.1091163-1-steve@sk2.org
      Signed-off-by: default avatarStephen Kitt <steve@sk2.org>
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: Joel Stanley <joel@jms.id.au>	[defconfig change]
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: James Morris <jmorris@namei.org>
      Cc: "Serge E . Hallyn" <serge@hallyn.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53944f17
    • Brian Geffon's avatar
      zram: introduce an aged idle interface · 755804d1
      Brian Geffon authored
      
      
      This change introduces an aged idle interface to the existing idle sysfs
      file for zram.
      
      When CONFIG_ZRAM_MEMORY_TRACKING is enabled the idle file now also
      accepts an integer argument.  This integer is the age (in seconds) of
      pages to mark as idle.  The idle file still supports 'all' as it always
      has.  This new approach allows for much more control over which pages
      get marked as idle.
      
      [bgeffon@google.com: use IS_ENABLED and cleanup comment]
        Link: https://lkml.kernel.org/r/20210924161128.1508015-1-bgeffon@google.com
      [bgeffon@google.com: Sergey's cleanup suggestions]
        Link: https://lkml.kernel.org/r/20210929143056.13067-1-bgeffon@google.com
      
      Link: https://lkml.kernel.org/r/20210923130115.1344361-1-bgeffon@google.com
      Signed-off-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Jesse Barnes <jsbarnes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      755804d1
    • Dan Carpenter's avatar
      zram: off by one in read_block_state() · a88e03cf
      Dan Carpenter authored
      snprintf() returns the number of bytes it would have printed if there
      were space.  But it does not count the NUL terminator.  So that means
      that if "count == copied" then this has already overflowed by one
      character.
      
      This bug likely isn't super harmful in real life.
      
      Link: https://lkml.kernel.org/r/20210916130404.GA25094@kili
      Fixes: c0265342
      
       ("zram: introduce zram memory tracking")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a88e03cf
    • Jaewon Kim's avatar
      zram_drv: allow reclaim on bio_alloc · 4aabdc14
      Jaewon Kim authored
      
      
      The read_from_bdev_async is not called on atomic context.  So GFP_NOIO
      is available rather than GFP_ATOMIC.  If there were reclaimable pages
      with GFP_NOIO, we can avoid allocation failure and page fault failure.
      
      Link: https://lkml.kernel.org/r/20210908005241.28062-1-jaewon31.kim@samsung.com
      Signed-off-by: default avatarJaewon Kim <jaewon31.kim@samsung.com>
      Reported-by: default avatarYong-Taek Lee <ytk.lee@samsung.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4aabdc14
    • Ira Weiny's avatar
      mm/highmem: remove deprecated kmap_atomic · d2c20e51
      Ira Weiny authored
      
      
      kmap_atomic() is being deprecated in favor of kmap_local_page().
      
      Replace the uses of kmap_atomic() within the highmem code.
      
      On profiling clear_huge_page() using ftrace an improvement of 62% was
      observed on the below setup.
      
      Setup:-
      Below data has been collected on Qualcomm's SM7250 SoC THP enabled
      (kernel v4.19.113) with only CPU-0(Cortex-A55) and CPU-7(Cortex-A76)
      switched on and set to max frequency, also DDR set to perf governor.
      
      FTRACE Data:-
      
      Base data:-
      Number of iterations: 48
      Mean of allocation time: 349.5 us
      std deviation: 74.5 us
      
      v4 data:-
      Number of iterations: 48
      Mean of allocation time: 131 us
      std deviation: 32.7 us
      
      The following simple userspace experiment to allocate
      100MB(BUF_SZ) of pages and writing to it gave us a good insight,
      we observed an improvement of 42% in allocation and writing timings.
      -------------------------------------------------------------
      Test code snippet
      -------------------------------------------------------------
            clock_start();
            buf = malloc(BUF_SZ); /* Allocate 100 MB of memory */
      
              for(i=0; i < BUF_SZ_PAGES; i++)
              {
                      *((int *)(buf + (i*PAGE_SIZE))) = 1;
              }
            clock_end();
      -------------------------------------------------------------
      
      Malloc test timings for 100MB anon allocation:-
      
      Base data:-
      Number of iterations: 100
      Mean of allocation time: 31831 us
      std deviation: 4286 us
      
      v4 data:-
      Number of iterations: 100
      Mean of allocation time: 18193 us
      std deviation: 4915 us
      
      [willy@infradead.org: fix zero_user_segments()]
        Link: https://lkml.kernel.org/r/YYVhHCJcm2DM2G9u@casper.infradead.org
      
      Link: https://lkml.kernel.org/r/20210204073255.20769-2-prathu.baronia@oneplus.com
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarPrathu Baronia <prathu.baronia@oneplus.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2c20e51
    • Miaohe Lin's avatar
      mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and zs_unregister_migration() · afe8605c
      Miaohe Lin authored
      There is one possible race window between zs_pool_dec_isolated() and
      zs_unregister_migration() because wait_for_isolated_drain() checks the
      isolated count without holding class->lock and there is no order inside
      zs_pool_dec_isolated().  Thus the below race window could be possible:
      
        zs_pool_dec_isolated		zs_unregister_migration
          check pool->destroying != 0
      				  pool->destroying = true;
      				  smp_mb();
      				  wait_for_isolated_drain()
      				    wait for pool->isolated_pages == 0
          atomic_long_dec(&pool->isolated_pages);
          atomic_long_read(&pool->isolated_pages) == 0
      
      Since we observe the pool->destroying (false) before atomic_long_dec()
      for pool->isolated_pages, waking pool->migration_wait up is missed.
      
      Fix this by ensure checking pool->destroying happens after the
      atomic_long_dec(&pool->isolated_pages).
      
      Link: https://lkml.kernel.org/r/20210708115027.7557-1-linmiaohe@huawei.com
      Fixes: 701d6785
      
       ("mm/zsmalloc.c: fix race condition in zs_destroy_pool")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Henry Burns <henryburns@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      afe8605c
    • Alistair Popple's avatar
      mm/rmap.c: avoid double faults migrating device private pages · 3d88705c
      Alistair Popple authored
      
      
      During migration special page table entries are installed for each page
      being migrated.  These entries store the pfn and associated permissions
      of ptes mapping the page being migarted.
      
      Device-private pages use special swap pte entries to distinguish
      read-only vs.  writeable pages which the migration code checks when
      creating migration entries.  Normally this follows a fast path in
      migrate_vma_collect_pmd() which correctly copies the permissions of
      device-private pages over to migration entries when migrating pages back
      to the CPU.
      
      However the slow-path falls back to using try_to_migrate() which
      unconditionally creates read-only migration entries for device-private
      pages.  This leads to unnecessary double faults on the CPU as the new
      pages are always mapped read-only even when they could be mapped
      writeable.  Fix this by correctly copying device-private permissions in
      try_to_migrate_one().
      
      Link: https://lkml.kernel.org/r/20211018045247.3128058-1-apopple@nvidia.com
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reported-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d88705c
    • David Hildenbrand's avatar
      mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED · 32befe9e
      David Hildenbrand authored
      
      
      Let's communicate driver-managed regions to memblock, to properly teach
      kexec_file with CONFIG_ARCH_KEEP_MEMBLOCK to not place images on these
      memory regions.
      
      Link: https://lkml.kernel.org/r/20211004093605.5830-6-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Jianyong Wu <Jianyong.Wu@arm.com>
      Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shahab Vahedi <shahab@synopsys.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32befe9e
    • David Hildenbrand's avatar
      memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED · f7892d8e
      David Hildenbrand authored
      
      
      Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED,
      indicating that we're dealing with a memory region that is never
      indicated in the firmware-provided memory map, but always detected and
      added by a driver.
      
      Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such
      memory regions like ordinary MEMBLOCK_NONE memory regions -- for
      example, when selecting memory regions to add to the vmcore for dumping
      in the crashkernel via for_each_mem_range().
      
      However, especially kexec_file is not supposed to select such memblocks
      via for_each_free_mem_range() / for_each_free_mem_range_reverse() to
      place kexec images, similar to how we handle
      IORESOURCE_SYSRAM_DRIVER_MANAGED without CONFIG_ARCH_KEEP_MEMBLOCK.
      
      We'll make sure that memory hotplug code sets the flag where applicable
      (IORESOURCE_SYSRAM_DRIVER_MANAGED) next.  This prepares architectures
      that need CONFIG_ARCH_KEEP_MEMBLOCK, such as arm64, for virtio-mem
      support.
      
      Note that kexec *must not* indicate this memory to the second kernel and
      *must not* place kexec-images on this memory.  Let's add a comment to
      kexec_walk_memblock(), documenting how we handle MEMBLOCK_DRIVER_MANAGED
      now just like using IORESOURCE_SYSRAM_DRIVER_MANAGED in
      locate_mem_hole_callback() for kexec_walk_resources().
      
      Also note that MEMBLOCK_HOTPLUG cannot be reused due to different
      semantics:
      	MEMBLOCK_HOTPLUG: memory is indicated as "System RAM" in the
      	firmware-provided memory map and added to the system early during
      	boot; kexec *has to* indicate this memory to the second kernel and
      	can place kexec-images on this memory. After memory hotunplug,
      	kexec has to be re-armed. We mostly ignore this flag when
      	"movable_node" is not set on the kernel command line, because
      	then we're told to not care about hotunpluggability of such
      	memory regions.
      
      	MEMBLOCK_DRIVER_MANAGED: memory is not indicated as "System RAM" in
      	the firmware-provided memory map; this memory is always detected
      	and added to the system by a driver; memory might not actually be
      	physically hotunpluggable. kexec *must not* indicate this memory to
      	the second kernel and *must not* place kexec-images on this memory.
      
      Link: https://lkml.kernel.org/r/20211004093605.5830-5-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Jianyong Wu <Jianyong.Wu@arm.com>
      Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shahab Vahedi <shahab@synopsys.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7892d8e
    • David Hildenbrand's avatar
      memblock: allow to specify flags with memblock_add_node() · 952eea9b
      David Hildenbrand authored
      
      
      We want to specify flags when hotplugging memory.  Let's prepare to pass
      flags to memblock_add_node() by adjusting all existing users.
      
      Note that when hotplugging memory the system is already up and running
      and we might have concurrent memblock users: for example, while we're
      hotplugging memory, kexec_file code might search for suitable memory
      regions to place kexec images.  It's important to add the memory
      directly to memblock via a single call with the right flags, instead of
      adding the memory first and apply flags later: otherwise, concurrent
      memblock users might temporarily stumble over memblocks with wrong
      flags, which will be important in a follow-up patch that introduces a
      new flag to properly handle add_memory_driver_managed().
      
      Link: https://lkml.kernel.org/r/20211004093605.5830-4-david@redhat.com
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: Shahab Vahedi <shahab@synopsys.com>	[arch/arc]
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Jianyong Wu <Jianyong.Wu@arm.com>
      Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      952eea9b
    • David Hildenbrand's avatar
      memblock: improve MEMBLOCK_HOTPLUG documentation · e14b4155
      David Hildenbrand authored
      
      
      The description of MEMBLOCK_HOTPLUG is currently short and consequently
      misleading: we're actually dealing with a memory region that might get
      hotunplugged later (i.e., the platform+firmware supports it), yet it is
      indicated in the firmware-provided memory map as system ram that will
      just get used by the system for any purpose when not taking special
      care.  The firmware marked this memory region as a hot(un)plugged (e.g.,
      hotplugged before reboot), implying that it might get hotunplugged again
      later.
      
      Whether we consider this information depends on the "movable_node"
      kernel commandline parameter: only with "movable_node" set, we'll try
      keeping this memory hotunpluggable, for example, by not serving early
      allocations from this memory region and by letting the buddy manage it
      using the ZONE_MOVABLE.
      
      Let's make this clearer by extending the documentation.
      
      Note: kexec *has to* indicate this memory to the second kernel.  With
      "movable_node" set, we don't want to place kexec-images on this memory.
      Without "movable_node" set, we don't care and can place kexec-images on
      this memory.  In both cases, after successful memory hotunplug, kexec
      has to be re-armed to update the memory map for the second kernel and to
      place the kexec-images somewhere else.
      
      Link: https://lkml.kernel.org/r/20211004093605.5830-3-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Jianyong Wu <Jianyong.Wu@arm.com>
      Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shahab Vahedi <shahab@synopsys.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e14b4155
    • David Hildenbrand's avatar
      mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource() · 53d38316
      David Hildenbrand authored
      Patch series "mm/memory_hotplug: full support for add_memory_driver_managed() with CONFIG_ARCH_KEEP_MEMBLOCK", v2.
      
      Architectures that require CONFIG_ARCH_KEEP_MEMBLOCK=y, such as arm64,
      don't cleanly support add_memory_driver_managed() yet.  Most
      prominently, kexec_file can still end up placing kexec images on such
      driver-managed memory, resulting in undesired behavior, for example,
      having kexec images located on memory not part of the firmware-provided
      memory map.
      
      Teaching kexec to not place images on driver-managed memory is
      especially relevant for virtio-mem.  Details can be found in commit
      7b7b2721
      
       ("mm/memory_hotplug: introduce
      add_memory_driver_managed()").
      
      Extend memblock with a new flag and set it from memory hotplug code when
      applicable.  This is required to fully support virtio-mem on arm64,
      making also kexec_file behave like on x86-64.
      
      This patch (of 2):
      
      If memblock_add_node() fails, we're most probably running out of memory.
      While this is unlikely to happen, it can happen and having memory added
      without a memblock can be problematic for architectures that use
      memblock to detect valid memory.  Let's fail in a nice way instead of
      silently ignoring the error.
      
      Link: https://lkml.kernel.org/r/20211004093605.5830-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20211004093605.5830-2-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Jianyong Wu <Jianyong.Wu@arm.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Shahab Vahedi <shahab@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53d38316
    • David Hildenbrand's avatar
      x86: remove memory hotplug support on X86_32 · 5c11f00b
      David Hildenbrand authored
      
      
      CONFIG_MEMORY_HOTPLUG was marked BROKEN over one year and we just
      restricted it to 64 bit.  Let's remove the unused x86 32bit
      implementation and simplify the Kconfig.
      
      Link: https://lkml.kernel.org/r/20210929143600.49379-7-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c11f00b
    • David Hildenbrand's avatar
      mm/memory_hotplug: remove stale function declarations · 43e3aa2a
      David Hildenbrand authored
      
      
      These functions no longer exist.
      
      Link: https://lkml.kernel.org/r/20210929143600.49379-6-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43e3aa2a
    • David Hildenbrand's avatar
      mm/memory_hotplug: remove HIGHMEM leftovers · 6b740c6c
      David Hildenbrand authored
      
      
      We don't support CONFIG_MEMORY_HOTPLUG on 32 bit and consequently not
      HIGHMEM.  Let's remove any leftover code -- including the unused
      "status_change_nid_high" field part of the memory notifier.
      
      Link: https://lkml.kernel.org/r/20210929143600.49379-5-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b740c6c
    • David Hildenbrand's avatar
      mm/memory_hotplug: restrict CONFIG_MEMORY_HOTPLUG to 64 bit · 7ec58a2b
      David Hildenbrand authored
      32 bit support is broken in various ways: for example, we can online
      memory that should actually go to ZONE_HIGHMEM to ZONE_MOVABLE or in
      some cases even to one of the other kernel zones.
      
      We marked it BROKEN in commit b59d02ed
      
       ("mm/memory_hotplug: disable
      the functionality for 32b") almost one year ago.  According to that
      commit it might be broken at least since 2017.  Further, there is hardly
      a sane use case nowadays.
      
      Let's just depend completely on 64bit, dropping the "BROKEN" dependency
      to make clear that we are not going to support it again.  Next, we'll
      remove some HIGHMEM leftovers from memory hotplug code to clean up.
      
      Link: https://lkml.kernel.org/r/20210929143600.49379-4-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ec58a2b
    • David Hildenbrand's avatar
      mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE · 50f9481e
      David Hildenbrand authored
      
      
      CONFIG_MEMORY_HOTPLUG depends on CONFIG_SPARSEMEM, so there is no need for
      CONFIG_MEMORY_HOTPLUG_SPARSE anymore; adjust all instances to use
      CONFIG_MEMORY_HOTPLUG and remove CONFIG_MEMORY_HOTPLUG_SPARSE.
      
      Link: https://lkml.kernel.org/r/20210929143600.49379-3-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: Shuah Khan <skhan@linuxfoundation.org>	[kselftest]
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50f9481e
    • David Hildenbrand's avatar
      mm/memory_hotplug: remove CONFIG_X86_64_ACPI_NUMA dependency from CONFIG_MEMORY_HOTPLUG · 71b6f2dd
      David Hildenbrand authored
      
      
      Patch series "mm/memory_hotplug: Kconfig and 32 bit cleanups".
      
      Some cleanups around CONFIG_MEMORY_HOTPLUG, including removing 32 bit
      leftovers of memory hotplug support.
      
      This patch (of 6):
      
      SPARSEMEM is the only possible memory model for x86-64, FLATMEM is not
      possible:
      
      	config ARCH_FLATMEM_ENABLE
      		def_bool y
      		depends on X86_32 && !NUMA
      
      And X86_64_ACPI_NUMA (obviously) only supports x86-64:
      
      	config X86_64_ACPI_NUMA
      		def_bool y
      		depends on X86_64 && NUMA && ACPI && PCI
      
      Let's just remove the CONFIG_X86_64_ACPI_NUMA dependency, as it does no
      longer make sense.
      
      Link: https://lkml.kernel.org/r/20210929143600.49379-2-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71b6f2dd
    • David Hildenbrand's avatar
      memory-hotplug.rst: document the "auto-movable" online policy · 9e122cc1
      David Hildenbrand authored
      Commit e83a437f
      
       ("mm/memory_hotplug: introduce "auto-movable" online
      policy") introduced a new memory online policy to automatically select a
      zone for memory blocks to be onlined.  It added a way to set the active
      online policy and tunables for the auto-movable online policy.
      
      Follow-up commits tweaked the "auto-movable" policy to also consider
      memory device details when selecting zones for memory blocks to be
      onlined.
      
      Let's document the new toggles and how the two online policies we have
      work.
      
      [david@redhat.com: updates]
        Link: https://lkml.kernel.org/r/20211011082058.6076-4-david@redhat.com
      
      Link: https://lkml.kernel.org/r/20210930144117.23641-4-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9e122cc1