Skip to content
  1. Apr 08, 2020
    • Ira Weiny's avatar
      include/linux/memremap.h: remove stale comments · 1d90b649
      Ira Weiny authored
      Fixes: 80a72d0a ("memremap: remove the data field in struct dev_pagemap")
      Fixes: fdc029b1
      
       ("memremap: remove the dev field in struct dev_pagemap")
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Link: http://lkml.kernel.org/r/20200316213205.145333-1-ira.weiny@intel.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d90b649
    • Steven Price's avatar
      include/linux/swapops.h: correct guards for non_swap_entry() · 3f3673d7
      Steven Price authored
      If CONFIG_DEVICE_PRIVATE is defined, but neither CONFIG_MEMORY_FAILURE nor
      CONFIG_MIGRATION, then non_swap_entry() will return 0, meaning that the
      condition (non_swap_entry(entry) && is_device_private_entry(entry)) in
      zap_pte_range() will never be true even if the entry is a device private
      one.
      
      Equally any other code depending on non_swap_entry() will not function as
      expected.
      
      I originally spotted this just by looking at the code, I haven't actually
      observed any problems.
      
      Looking a bit more closely it appears that actually this situation
      (currently at least) cannot occur:
      
      DEVICE_PRIVATE depends on ZONE_DEVICE
      ZONE_DEVICE depends on MEMORY_HOTREMOVE
      MEMORY_HOTREMOVE depends on MIGRATION
      
      Fixes: 5042db43
      
       ("mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory")
      Signed-off-by: default avatarSteven Price <steven.price@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Link: http://lkml.kernel.org/r/20200305130550.22693-1-steven.price@arm.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f3673d7
    • Joe Perches's avatar
      mm: use fallthrough; · e4a9bc58
      Joe Perches authored
      
      
      Convert the various /* fallthrough */ comments to the pseudo-keyword
      fallthrough;
      
      Done via script:
      https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/
      
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Link: http://lkml.kernel.org/r/f62fea5d10eb0ccfc05d87c242a620c261219b66.camel@perches.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4a9bc58
    • Mateusz Nosek's avatar
      mm/mm_init.c: clean code. Use BUILD_BUG_ON when comparing compile time constant · e46b893d
      Mateusz Nosek authored
      
      
      MAX_ZONELISTS is a compile time constant, so it should be compared using
      BUILD_BUG_ON not BUG_ON.
      
      Signed-off-by: default avatarMateusz Nosek <mateusznosek0@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Link: http://lkml.kernel.org/r/20200228224617.11343-1-mateusznosek0@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e46b893d
    • chenqiwu's avatar
      mm: fix ambiguous comments for better code readability · 552657b7
      chenqiwu authored
      
      
      The parameter of remap_pfn_range() @pfn passed from the caller is actually
      a page-frame number converted by corresponding physical address of kernel
      memory, the original comment is ambiguous that may mislead the users.
      
      Meanwhile, there is an ambiguous typo "VMM" in the comment of
      vm_area_struct.  So fixing them will make the code more readable.
      
      Signed-off-by: default avatarchenqiwu <chenqiwu@xiaomi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1583026921-15279-1-git-send-email-qiwuchen55@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      552657b7
    • Jules Irenge's avatar
      mm/zsmalloc: add missing annotation for unpin_tag() · bc22b18b
      Jules Irenge authored
      
      
      Sparse reports a warning at unpin_tag()()
      
      warning: context imbalance in unpin_tag() - unexpected unlock
      
      The root cause is the missing annotation at unpin_tag()
      Add the missing __releases(bitlock) annotation
      
      Signed-off-by: default avatarJules Irenge <jbi.octave@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/20200214204741.94112-14-jbi.octave@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc22b18b
    • Jules Irenge's avatar
      mm/zsmalloc: add missing annotation for pin_tag() · 70c7ec95
      Jules Irenge authored
      
      
      Sparse reports a warning at pin_tag()()
      
      warning: context imbalance in pin_tag() - wrong count at exit
      
      The root cause is the missing annotation at pin_tag()
      Add the missing __acquires(bitlock) annotation
      
      Signed-off-by: default avatarJules Irenge <jbi.octave@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/20200214204741.94112-13-jbi.octave@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70c7ec95
    • Jules Irenge's avatar
      mm/zsmalloc: add missing annotation for migrate_read_unlock() · 8a374ccc
      Jules Irenge authored
      
      
      Sparse reports a warning at migrate_read_unlock()()
      
       warning: context imbalance in migrate_read_unlock() - unexpected unlock
      
      The root cause is the missing annotation at migrate_read_unlock()
      Add the missing __releases(&zspage->lock) annotation
      
      Signed-off-by: default avatarJules Irenge <jbi.octave@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/20200214204741.94112-12-jbi.octave@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a374ccc
    • Jules Irenge's avatar
      mm/zsmalloc: add missing annotation for migrate_read_lock() · cfc451cf
      Jules Irenge authored
      
      
      Sparse reports a warning at migrate_read_lock()()
      
       warning: context imbalance in migrate_read_lock() - wrong count at exit
      
      The root cause is the missing annotation at migrate_read_lock()
      Add the missing __acquires(&zspage->lock) annotation
      
      Signed-off-by: default avatarJules Irenge <jbi.octave@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/20200214204741.94112-11-jbi.octave@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cfc451cf
    • Jules Irenge's avatar
      mm/slub: add missing annotation for put_map() · 81aba9e0
      Jules Irenge authored
      
      
      Sparse reports a warning at put_map()()
      
       warning: context imbalance in put_map() - unexpected unlock
      
      The root cause is the missing annotation at put_map()
      Add the missing __releases(&object_map_lock) annotation
      
      Signed-off-by: default avatarJules Irenge <jbi.octave@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200214204741.94112-10-jbi.octave@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81aba9e0
    • Jules Irenge's avatar
      mm/slub: add missing annotation for get_map() · 31364c2e
      Jules Irenge authored
      
      
      Sparse reports a warning at get_map()()
      
       warning: context imbalance in get_map() - wrong count at exit
      
      The root cause is the missing annotation at get_map()
      Add the missing __acquires(&object_map_lock) annotation
      
      Signed-off-by: default avatarJules Irenge <jbi.octave@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200214204741.94112-9-jbi.octave@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31364c2e
    • Jules Irenge's avatar
      mm/mempolicy: add missing annotation for queue_pages_pmd() · 959a7e13
      Jules Irenge authored
      
      
      Sparse reports a warning at queue_pages_pmd()
      
      context imbalance in queue_pages_pmd() - unexpected unlock
      
      The root cause is the missing annotation at queue_pages_pmd()
      Add the missing __releases(ptl)
      
      Signed-off-by: default avatarJules Irenge <jbi.octave@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200214204741.94112-8-jbi.octave@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      959a7e13
    • Jules Irenge's avatar
      mm/hugetlb: add missing annotation for gather_surplus_pages() · 1b2a1e7b
      Jules Irenge authored
      
      
      Sparse reports a warning at gather_surplus_pages()
      
      warning: context imbalance in hugetlb_cow() - unexpected unlock
      
      The root cause is the missing annotation at gather_surplus_pages()
      Add the missing __must_hold(&hugetlb_lock)
      
      Signed-off-by: default avatarJules Irenge <jbi.octave@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Link: http://lkml.kernel.org/r/20200214204741.94112-7-jbi.octave@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b2a1e7b
    • Jules Irenge's avatar
      mm/compaction: add missing annotation for compact_lock_irqsave · 77337ede
      Jules Irenge authored
      
      
      Sparse reports a warning at compact_lock_irqsave()
      
      warning: context imbalance in compact_lock_irqsave() - wrong count at exit
      
      The root cause is the missing annotation at compact_lock_irqsave()
      Add the missing __acquires(lock) annotation.
      
      Signed-off-by: default avatarJules Irenge <jbi.octave@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200214204741.94112-6-jbi.octave@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77337ede
    • Maciej S. Szmigiero's avatar
      mm/zswap: allow setting default status, compressor and allocator in Kconfig · bb8b93b5
      Maciej S. Szmigiero authored
      The compressed cache for swap pages (zswap) currently needs from 1 to 3
      extra kernel command line parameters in order to make it work: it has to
      be enabled by adding a "zswap.enabled=1" command line parameter and if one
      wants a different compressor or pool allocator than the default lzo / zbud
      combination then these choices also need to be specified on the kernel
      command line in additional parameters.
      
      Using a different compressor and allocator for zswap is actually pretty
      common as guides often recommend using the lz4 / z3fold pair instead of
      the default one.  In such case it is also necessary to remember to enable
      the appropriate compression algorithm and pool allocator in the kernel
      config manually.
      
      Let's avoid the need for adding these kernel command line parameters and
      automatically pull in the dependencies for the selected compressor
      algorithm and pool allocator by adding an appropriate default switches to
      Kconfig.
      
      The def...
      bb8b93b5
    • Palmer Dabbelt's avatar
      mm: prevent a warning when casting void* -> enum · 4708f318
      Palmer Dabbelt authored
      
      
      I recently build the RISC-V port with LLVM trunk, which has introduced a
      new warning when casting from a pointer to an enum of a smaller size.
      This patch simply casts to a long in the middle to stop the warning.  I'd
      be surprised this is the only one in the kernel, but it's the only one I
      saw.
      
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200227211741.83165-1-palmer@dabbelt.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4708f318
    • Hugh Dickins's avatar
      mm: huge tmpfs: try to split_huge_page() when punching hole · 71725ed1
      Hugh Dickins authored
      
      
      Yang Shi writes:
      
      Currently, when truncating a shmem file, if the range is partly in a THP
      (start or end is in the middle of THP), the pages actually will just get
      cleared rather than being freed, unless the range covers the whole THP.
      Even though all the subpages are truncated (randomly or sequentially), the
      THP may still be kept in page cache.
      
      This might be fine for some usecases which prefer preserving THP, but
      balloon inflation is handled in base page size.  So when using shmem THP
      as memory backend, QEMU inflation actually doesn't work as expected since
      it doesn't free memory.  But the inflation usecase really needs to get the
      memory freed.  (Anonymous THP will also not get freed right away, but will
      be freed eventually when all subpages are unmapped: whereas shmem THP
      still stays in page cache.)
      
      Split THP right away when doing partial hole punch, and if split fails
      just clear the page so that read of the punched area will return zeroes.
      
      Hugh Dickins adds:
      
      Our earlier "team of pages" huge tmpfs implementation worked in the way
      that Yang Shi proposes; and we have been using this patch to continue to
      split the huge page when hole-punched or truncated, since converting over
      to the compound page implementation.  Although huge tmpfs gives out huge
      pages when available, if the user specifically asks to truncate or punch a
      hole (perhaps to free memory, perhaps to reduce the memcg charge), then
      the filesystem should do so as best it can, splitting the huge page.
      
      That is not always possible: any additional reference to the huge page
      prevents split_huge_page() from succeeding, so the result can be flaky.
      But in practice it works successfully enough that we've not seen any
      problem from that.
      
      Add shmem_punch_compound() to encapsulate the decision of when a split is
      needed, and doing the split if so.  Using this simplifies the flow in
      shmem_undo_range(); and the first (trylock) pass does not need to do any
      page clearing on failure, because the second pass will either succeed or
      do that clearing.  Following the example of zero_user_segment() when
      clearing a partial page, add flush_dcache_page() and set_page_dirty() when
      clearing a hole - though I'm not certain that either is needed.
      
      But: split_huge_page() would be sure to fail if shmem_undo_range()'s
      pagevec holds further references to the huge page.  The easiest way to fix
      that is for find_get_entries() to return early, as soon as it has put one
      compound head or tail into the pagevec.  At first this felt like a hack;
      but on examination, this convention better suits all its callers - or will
      do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
      and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
      speedup by checking for compound pages there.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvils
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71725ed1
    • Mateusz Nosek's avatar
      mm/shmem.c: clean code by removing unnecessary assignment · 343c3d7f
      Mateusz Nosek authored
      
      
      Previously 0 was assigned to variable 'error' but the variable was never
      read before reassignemnt later.  So the assignment can be removed.
      
      Signed-off-by: default avatarMateusz Nosek <mateusznosek0@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200301152832.24595-1-mateusznosek0@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      343c3d7f
    • Kees Cook's avatar
      mm/shmem.c: distribute switch variables for initialization · 27d80fa2
      Kees Cook authored
      
      
      Variables declared in a switch statement before any case statements cannot
      be automatically initialized with compiler instrumentation (as they are
      not part of any execution flow).  With GCC's proposed automatic stack
      variable initialization feature, this triggers a warning (and they don't
      get initialized).  Clang's automatic stack variable initialization (via
      CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also doesn't
      initialize such variables[1].  Note that these warnings (or silent
      skipping) happen before the dead-store elimination optimization phase, so
      even when the automatic initializations are later elided in favor of
      direct initializations, the warnings remain.
      
      To avoid these problems, move such variables into the "case" where they're
      used or lift them up into the main function body.
      
      mm/shmem.c: In function `shmem_getpage_gfp':
      mm/shmem.c:1816:10: warning: statement will never be executed [-Wswitch-unreachable]
       1816 |   loff_t i_size;
            |          ^~~~~~
      
      [1] https://bugs.llvm.org/show_bug.cgi?id=44916
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Link: http://lkml.kernel.org/r/20200220062312.69165-1-keescook@chromium.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27d80fa2
    • chenqiwu's avatar
      mm/memory_hotplug.c: use __pfn_to_section() instead of open-coding · 10404901
      chenqiwu authored
      
      
      Use __pfn_to_section() API instead of open-coding for better code
      readability.
      
      Signed-off-by: default avatarchenqiwu <chenqiwu@xiaomi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Link: http://lkml.kernel.org/r/1584345134-16671-1-git-send-email-qiwuchen55@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10404901
    • David Hildenbrand's avatar
      mm/memory_hotplug: allow to specify a default online_type · 5f47adf7
      David Hildenbrand authored
      For now, distributions implement advanced udev rules to essentially
      - Don't online any hotplugged memory (s390x)
      - Online all memory to ZONE_NORMAL (e.g., most virt environments like
        hyperv)
      - Online all memory to ZONE_MOVABLE in case the zone imbalance is taken
        care of (e.g., bare metal, special virt environments)
      
      In summary: All memory is usually onlined the same way, however, the
      kernel always has to ask user space to come up with the same answer.
      E.g., Hyper-V always waits for a memory block to get onlined before
      continuing, otherwise it might end up adding memory faster than
      onlining it, which can result in strange OOM situations.  This waiting
      slows down adding of a bigger amount of memory.
      
      Let's allow to specify a default online_type, not just "online" and
      "offline".  This allows distributions to configure the default online_type
      when booting up and be done with it.
      
      We can now specify "offline", "online", "online_movable" and
      "online_...
      5f47adf7
    • David Hildenbrand's avatar
      mm/memory_hotplug: convert memhp_auto_online to store an online_type · 862919e5
      David Hildenbrand authored
      
      
      ...  and rename it to memhp_default_online_type.  This is a preparation
      for more detailed default online behavior.
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-8-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      862919e5
    • David Hildenbrand's avatar
      mm/memory_hotplug: unexport memhp_auto_online · 5a04af13
      David Hildenbrand authored
      
      
      All in-tree users except the mm-core are gone. Let's drop the export.
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-7-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a04af13
    • David Hildenbrand's avatar
      hv_balloon: don't check for memhp_auto_online manually · bc58ebd5
      David Hildenbrand authored
      We get the MEM_ONLINE notifier call if memory is added right from the
      kernel via add_memory() or later from user space.
      
      Let's get rid of the "ha_waiting" flag - the wait event has an inbuilt
      mechanism (->done) for that.  Initialize the wait event only once and
      reinitialize before adding memory.  Unconditionally call complete() and
      wait_for_completion_timeout().
      
      If there are no waiters, complete() will only increment ->done - which
      will be reset by reinit_completion().  If complete() has already been
      called, wait_for_completion_timeout() will not wait.
      
      There is still the chance for a small race between concurrent
      reinit_completion() and complete().  If complete() wins, we would not wait
      - which is tolerable (and the race exists in current code as well).
      
      Note: We only wait for "some" memory to get onlined, which seems to be
            good enough for now.
      
      [akpm@linux-foundation.org: register_memory_notifier() after init_completion(), per David]
      Signed-o...
      bc58ebd5
    • David Hildenbrand's avatar
      powernv/memtrace: always online added memory blocks · ed7f9fec
      David Hildenbrand authored
      
      
      Let's always try to online the re-added memory blocks.  In case
      add_memory() already onlined the added memory blocks, the first
      device_online() call will fail and stop processing the remaining memory
      blocks.
      
      This avoids manually having to check memhp_auto_online.
      
      Note: PPC always onlines all hotplugged memory directly from the kernel as
      well - something that is handled by user space on other architectures.
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-5-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed7f9fec
    • David Hildenbrand's avatar
      drivers/base/memory: store mapping between MMOP_* and string in an array · 4dc8207b
      David Hildenbrand authored
      
      
      Let's use a simple array which we can reuse soon.  While at it, move the
      string->mmop conversion out of the device hotplug lock.
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-4-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4dc8207b
    • David Hildenbrand's avatar
      drivers/base/memory: map MMOP_OFFLINE to 0 · efc978ad
      David Hildenbrand authored
      
      
      Historically, we used the value -1.  Just treat 0 as the special case now.
      Clarify a comment (which was wrong, when we come via device_online() the
      first time, the online_type would have been 0 / MEM_ONLINE).  The default
      is now always MMOP_OFFLINE.  This removes the last user of the manual
      "-1", which didn't use the enum value.
      
      This is a preparation to use the online_type as an array index.
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-3-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      efc978ad
    • David Hildenbrand's avatar
      drivers/base/memory: rename MMOP_ONLINE_KEEP to MMOP_ONLINE · 956f8b44
      David Hildenbrand authored
      
      
      Patch series "mm/memory_hotplug: allow to specify a default online_type", v3.
      
      Distributions nowadays use udev rules ([1] [2]) to specify if and how to
      online hotplugged memory.  The rules seem to get more complex with many
      special cases.  Due to the various special cases,
      CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used.  All memory hotplug
      is handled via udev rules.
      
      Every time we hotplug memory, the udev rule will come to the same
      conclusion.  Especially Hyper-V (but also soon virtio-mem) add a lot of
      memory in separate memory blocks and wait for memory to get onlined by
      user space before continuing to add more memory blocks (to not add memory
      faster than it is getting onlined).  This of course slows down the whole
      memory hotplug process.
      
      To make the job of distributions easier and to avoid udev rules that get
      more and more complicated, let's extend the mechanism provided by
      - /sys/devices/system/memory/auto_online_blocks
      - "memhp_default_state=" on the kernel cmdline
      to be able to specify also "online_movable" as well as "online_kernel"
      
      === Example /usr/libexec/config-memhotplug ===
      
      #!/bin/bash
      
      VIRT=`systemd-detect-virt --vm`
      ARCH=`uname -p`
      
      sense_virtio_mem() {
        if [ -d "/sys/bus/virtio/drivers/virtio_mem/" ]; then
          DEVICES=`find /sys/bus/virtio/drivers/virtio_mem/ -maxdepth 1 -type l | wc -l`
          if [ $DEVICES != "0" ]; then
              return 0
          fi
        fi
        return 1
      }
      
      if [ ! -e "/sys/devices/system/memory/auto_online_blocks" ]; then
        echo "Memory hotplug configuration support missing in the kernel"
        exit 1
      fi
      
      if grep "memhp_default_state=" /proc/cmdline > /dev/null; then
        echo "Memory hotplug configuration overridden in kernel cmdline (memhp_default_state=)"
        exit 1
      fi
      
      if [ $VIRT == "microsoft" ]; then
        echo "Detected Hyper-V on $ARCH"
        # Hyper-V wants all memory in ZONE_NORMAL
        ONLINE_TYPE="online_kernel"
      elif sense_virtio_mem; then
        echo "Detected virtio-mem on $ARCH"
        # virtio-mem wants all memory in ZONE_NORMAL
        ONLINE_TYPE="online_kernel"
      elif [ $ARCH == "s390x" ] || [ $ARCH == "s390" ]; then
        echo "Detected $ARCH"
        # standby memory should not be onlined automatically
        ONLINE_TYPE="offline"
      elif [ $ARCH == "ppc64" ] || [ $ARCH == "ppc64le" ]; then
        echo "Detected" $ARCH
        # PPC64 onlines all hotplugged memory right from the kernel
        ONLINE_TYPE="offline"
      elif [ $VIRT == "none" ]; then
        echo "Detected bare-metal on $ARCH"
        # Bare metal users expect hotplugged memory to be unpluggable. We assume
        # that ZONE imbalances on such enterpise servers cannot happen and is
        # properly documented
        ONLINE_TYPE="online_movable"
      else
        # TODO: Hypervisors that want to unplug DIMMs and can guarantee that ZONE
        # imbalances won't happen
        echo "Detected $VIRT on $ARCH"
        # Usually, ballooning is used in virtual environments, so memory should go to
        # ZONE_NORMAL. However, sometimes "movable_node" is relevant.
        ONLINE_TYPE="online"
      fi
      
      echo "Selected online_type:" $ONLINE_TYPE
      
      # Configure what to do with memory that will be hotplugged in the future
      echo $ONLINE_TYPE 2>/dev/null > /sys/devices/system/memory/auto_online_blocks
      if [ $? != "0" ]; then
        echo "Memory hotplug cannot be configured (e.g., old kernel or missing permissions)"
        # A backup udev rule should handle old kernels if necessary
        exit 1
      fi
      
      # Process all already pluggedd blocks (e.g., DIMMs, but also Hyper-V or virtio-mem)
      if [ $ONLINE_TYPE != "offline" ]; then
        for MEMORY in /sys/devices/system/memory/memory*; do
          STATE=`cat $MEMORY/state`
          if [ $STATE == "offline" ]; then
              echo $ONLINE_TYPE > $MEMORY/state
          fi
        done
      fi
      
      === Example /usr/lib/systemd/system/config-memhotplug.service ===
      
      [Unit]
      Description=Configure memory hotplug behavior
      DefaultDependencies=no
      Conflicts=shutdown.target
      Before=sysinit.target shutdown.target
      After=systemd-modules-load.service
      ConditionPathExists=|/sys/devices/system/memory/auto_online_blocks
      
      [Service]
      ExecStart=/usr/libexec/config-memhotplug
      Type=oneshot
      TimeoutSec=0
      RemainAfterExit=yes
      
      [Install]
      WantedBy=sysinit.target
      
      === Example modification to the 40-redhat.rules [2] ===
      
      : diff --git a/40-redhat.rules b/40-redhat.rules-new
      : index 2c690e5..168fd03 100644
      : --- a/40-redhat.rules
      : +++ b/40-redhat.rules-new
      : @@ -6,6 +6,9 @@ SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}
      :  # Memory hotadd request
      :  SUBSYSTEM!="memory", GOTO="memory_hotplug_end"
      :  ACTION!="add", GOTO="memory_hotplug_end"
      : +# memory hotplug behavior configured
      : +PROGRAM=="grep online /sys/devices/system/memory/auto_online_blocks", GOTO="memory_hotplug_end"
      : +
      :  PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
      :
      :  ENV{.state}="online"
      
      ===
      
      [1] https://github.com/lnykryn/systemd-rhel/pull/281
      [2] https://github.com/lnykryn/systemd-rhel/blob/staging/rules/40-redhat.rules
      
      This patch (of 8):
      
      The name is misleading and it's not really clear what is "kept".  Let's
      just name it like the online_type name we expose to user space ("online").
      
      Add some documentation to the types.
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: http://lkml.kernel.org/r/20200319131221.14044-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20200317104942.11178-2-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      956f8b44
    • Baoquan He's avatar
      mm/sparse.c: move subsection_map related functions together · 6ecb0fc6
      Baoquan He authored
      
      
      No functional change.
      
      [bhe@redhat.com: move functions into CONFIG_MEMORY_HOTPLUG ifdeffery scope]
        Link: http://lkml.kernel.org/r/20200316045804.GC3486@MiWiFi-R3L-srv
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200312124414.439-6-bhe@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ecb0fc6
    • Baoquan He's avatar
      mm/sparse.c: add note about only VMEMMAP supporting sub-section hotplug · 95a5a34d
      Baoquan He authored
      
      
      And tell check_pfn_span() gating the porper alignment and size of hot
      added memory region.
      
      And also move the code comments from inside section_deactivate() to being
      above it.  The code comments are reasonable for the whole function, and
      the moving makes code cleaner.
      
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Link: http://lkml.kernel.org/r/20200312124414.439-5-bhe@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95a5a34d
    • Baoquan He's avatar
      mm/sparse.c: only use subsection map in VMEMMAP case · 0a9f9f62
      Baoquan He authored
      
      
      Currently, to support subsection aligned memory region adding for pmem,
      subsection map is added to track which subsection is present.
      
      However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP.  It means
      subsection map only makes sense when SPARSEMEM_VMEMMAP enabled.  For the
      classic sparse, it's meaningless.  Even worse, it may confuse people when
      checking code related to the classic sparse.
      
      About the classic sparse which doesn't support subsection hotplug, Dan
      said it's more because the effort and maintenance burden outweighs the
      benefit.  Besides, the current 64 bit ARCHes all enable
      SPARSEMEM_VMEMMAP_ENABLE by default.
      
      Combining the above reasons, no need to provide subsection map and the
      relevant handling for the classic sparse.  Let's remove them.
      
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Link: http://lkml.kernel.org/r/20200312124414.439-4-bhe@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a9f9f62
    • Baoquan He's avatar
      mm/sparse.c: introduce a new function clear_subsection_map() · 37bc1502
      Baoquan He authored
      
      
      Factor out the code which clear subsection map of one memory region from
      section_deactivate() into clear_subsection_map().
      
      And also add helper function is_subsection_map_empty() to check if the
      current subsection map is empty or not.
      
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Link: http://lkml.kernel.org/r/20200312124414.439-3-bhe@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      37bc1502
    • Baoquan He's avatar
      mm/sparse.c: introduce new function fill_subsection_map() · 5d87255c
      Baoquan He authored
      
      
      Patch series "mm/hotplug: Only use subsection map for VMEMMAP", v4.
      
      Memory sub-section hotplug was added to fix the issue that nvdimm could be
      mapped at non-section aligned starting address.  A subsection map is added
      into struct mem_section_usage to implement it.
      
      However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP.  It means
      subsection map only makes sense when SPARSEMEM_VMEMMAP enabled.  For the
      classic sparse, subsection map is meaningless and confusing.
      
      About the classic sparse which doesn't support subsection hotplug, Dan
      said it's more because the effort and maintenance burden outweighs the
      benefit.  Besides, the current 64 bit ARCHes all enable
      SPARSEMEM_VMEMMAP_ENABLE by default.
      
      This patch (of 5):
      
      Factor out the code that fills the subsection map from section_activate()
      into fill_subsection_map(), this makes section_activate() cleaner and
      easier to follow.
      
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200312124414.439-2-bhe@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d87255c
    • David Hildenbrand's avatar
      mm/memory_hotplug.c: cleanup __add_pages() · 6cdd0b30
      David Hildenbrand authored
      
      
      Let's drop the basically unused section stuff and simplify.  The logic now
      matches the logic in __remove_pages().
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Link: http://lkml.kernel.org/r/20200228095819.10750-3-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6cdd0b30
    • David Hildenbrand's avatar
      mm/memory_hotplug.c: simplify calculation of number of pages in __remove_pages() · a11b9419
      David Hildenbrand authored
      In commit 52fb87c8
      
       ("mm/memory_hotplug: cleanup __remove_pages()"), we
      cleaned up __remove_pages(), and introduced a shorter variant to calculate
      the number of pages to the next section boundary.
      
      Turns out we can make this calculation easier to read.  We always want to
      have the number of pages (> 0) to the next section boundary, starting from
      the current pfn.
      
      We'll clean up __remove_pages() in a follow-up patch and directly make use
      of this computation.
      
      Suggested-by: default avatarSegher Boessenkool <segher@kernel.crashing.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Link: http://lkml.kernel.org/r/20200228095819.10750-2-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a11b9419
    • Baoquan He's avatar
      mm/memory_hotplug.c: only respect mem= parameter during boot stage · f3cd4c86
      Baoquan He authored
      In commit 357b4da5 ("x86: respect memory size limiting via mem=
      parameter") a global varialbe max_mem_size is added to store the value
      parsed from 'mem= ', then checked when memory region is added.  This truly
      stops those DIMMs from being added into system memory during boot-time.
      
      However, it also limits the later memory hotplug functionality.  Any DIMM
      can't be hotplugged any more if its region is beyond the max_mem_size.  We
      will get errors like:
      
      [  216.387164] acpi PNP0C80:02: add_memory failed
      [  216.389301] acpi PNP0C80:02: acpi_memory_enable_device() error
      [  216.392187] acpi PNP0C80:02: Enumeration failure
      
      This will cause issue in a known use case where 'mem=' is added to the
      hypervisor.  The memory that lies after 'mem=' boundary will be assigned
      to KVM guests.  After commit 357b4da5 merged, memory can't be extended
      dynamically if system memory on hypervisor is not sufficient.
      
      So fix it by also checking if it's during boot-time restricting to add
      memory.  Otherwise, skip the restriction.
      
      And also add this use case to document of 'mem=' kernel parameter.
      
      Fixes: 357b4da5
      
       ("x86: respect memory size limiting via mem= parameter")
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200204050643.20925-1-bhe@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3cd4c86
    • David Hildenbrand's avatar
      mm/page_ext.c: drop pfn_present() check when onlining · dccacf8d
      David Hildenbrand authored
      Since commit c5e79ef5
      
       ("mm/memory_hotplug.c: don't allow to
      online/offline memory blocks with holes") we disallow to offline any
      memory with holes.  As all boot memory is online and hotplugged memory
      cannot contain holes, we never online memory with holes.
      
      This present check can be dropped.
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Link: http://lkml.kernel.org/r/20200127110424.5757-4-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dccacf8d
    • David Hildenbrand's avatar
      drivers/base/memory.c: drop pages_correctly_probed() · fada9ae3
      David Hildenbrand authored
      pages_correctly_probed() is a leftover from ancient times.  It dates back
      to commit 3947be19 ("[PATCH] memory hotplug: sysfs and add/remove
      functions"), where Pg_reserved checks were added as a sfety net:
      
      	/*
      	 * The probe routines leave the pages reserved, just
      	 * as the bootmem code does.  Make sure they're still
      	 * that way.
      	 */
      
      The checks were refactored quite a bit over the years, especially in
      commit b77eab70 ("mm/memory_hotplug: optimize probe routine"), where
      checks for present, valid, and online sections were added.
      
      Hotplugged memory is added via add_memory(), which will create the full
      memmap for the hotplugged memory, and mark all sections valid and present.
      
      Only full memory blocks are onlined/offlined, so we also cannot have an
      inconsistency in that regard (especially, memory blocks with some sections
      being online and some being offline).
      
      1. Boot memory always starts online.  Since commit c5e79ef5
      
      
         ("mm/memory_hotplug.c: don't allow to online/offline memory blocks with
         holes") we disallow to offline any memory with holes.  Therefore, we
         never online memory with holes.  Present and validity checks are
         superfluous.
      
      2. Only complete memory blocks are onlined/offlined (and especially,
         the state - online or offline - is stored for whole memory blocks).
         Besides the core, only arch/powerpc/platforms/powernv/memtrace.c
         manually calls offline_pages() and fiddels with memory block states.
         But it also only offlines complete memory blocks.
      
      3. To make any of these conditions trigger, something would have to be
         terribly messed up in the core.  (e.g., online/offline only some
         sections of a memory block).
      
      4. Memory unplug properly makes sure that all sysfs attributes were
         removed (and therefore, that all threads left the sysfs handlers).  We
         don't have to worry about zombie devices at this point.
      
      5. The valid_section_nr(section_nr) check is actually dead code, as it
         would never have been reached due to the WARN_ON_ONCE(!pfn_valid(pfn)).
      
      No wonder we haven't seen any of these errors in a long time (or even
         ever, according to my search).  Let's just get rid of them.  Now, all
         checks that could hinder onlining and offlining are completely
         contained in online_pages()/offline_pages().
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Link: http://lkml.kernel.org/r/20200127110424.5757-3-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fada9ae3
    • David Hildenbrand's avatar
      drivers/base/memory.c: drop section_count · 68c3a6ac
      David Hildenbrand authored
      Patch series "mm: drop superfluous section checks when onlining/offlining".
      
      Let's drop some superfluous section checks on the onlining/offlining path.
      
      This patch (of 3):
      
      Since commit c5e79ef5 ("mm/memory_hotplug.c: don't allow to
      online/offline memory blocks with holes") we have a generic check in
      offline_pages() that disallows offlining memory blocks with holes.
      
      Memory blocks with missing sections are just another variant of these type
      of blocks.  We can stop checking (and especially storing) present
      sections.  A proper error message is now printed why offlining failed.
      
      section_count was initially introduced in commit 07681215 ("Driver
      core: Add section count to memory_block struct") in order to detect when
      it is okay to remove a memory block.  It was used in commit 26bbe7ef
      ("drivers/base/memory.c: prohibit offlining of memory blocks with missing
      sections") to disallow offlining memory blocks with missing sections.  As
      we refactored creation/removal of memory devices and have a proper check
      for holes in place, we can drop the section_count.
      
      This also removes a leftover comment regarding the mem_sysfs_mutex, which
      was removed in commit 848e19ad
      
       ("drivers/base/memory.c: drop the
      mem_sysfs_mutex").
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Link: http://lkml.kernel.org/r/20200127110424.5757-2-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68c3a6ac
    • Peter Xu's avatar
      userfaultfd: selftests: add write-protect test · 9b12488a
      Peter Xu authored
      
      
      Add uffd tests for write protection.
      
      Instead of introducing new tests for it, let's simply squashing uffd-wp
      tests into existing uffd-missing test cases.  Changes are:
      
      (1) Bouncing tests
      
        We do the write-protection in two ways during the bouncing test:
      
        - By using UFFDIO_COPY_MODE_WP when resolving MISSING pages: then
          we'll make sure for each bounce process every single page will be
          at least fault twice: once for MISSING, once for WP.
      
        - By direct call UFFDIO_WRITEPROTECT on existing faulted memories:
          To further torture the explicit page protection procedures of
          uffd-wp, we split each bounce procedure into two halves (in the
          background thread): the first half will be MISSING+WP for each
          page as explained above.  After the first half, we write protect
          the faulted region in the background thread to make sure at least
          half of the pages will be write protected again which is the first
          half to test the new UFFDIO_WRITEPROTECT call.  Then we continue
          with the 2nd half, which will contain both MISSING and WP faulting
          tests for the 2nd half and WP-only faults from the 1st half.
      
      (2) Event/Signal test
      
        Mostly previous tests but will do MISSING+WP for each page.  For
        sigbus-mode test we'll need to provide standalone path to handle the
        write protection faults.
      
      For all tests, do statistics as well for uffd-wp pages.
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-20-peterx@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9b12488a