Skip to content
  1. Feb 27, 2021
    • Miaohe Lin's avatar
      mm/rmap: use page_not_mapped in try_to_unmap() · b7e188ec
      Miaohe Lin authored
      
      
      page_mapcount_is_zero() calculates accurately how many mappings a hugepage
      has in order to check against 0 only.  This is a waste of cpu time.  We
      can do this via page_not_mapped() to save some possible atomic_read
      cycles.  Remove the function page_mapcount_is_zero() as it's not used
      anymore and move page_not_mapped() above try_to_unmap() to avoid
      identifier undeclared compilation error.
      
      Link: https://lkml.kernel.org/r/20210130084904.35307-1-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b7e188ec
    • Miaohe Lin's avatar
      mm/rmap: fix obsolete comment in __page_check_anon_rmap() · 90aaca85
      Miaohe Lin authored
      Commit 21333b2b
      
       ("ksm: no debug in page_dup_rmap()") has reverted
      page_dup_rmap() to an inline atomic_inc of mapcount.  So page_dup_rmap()
      does not call __page_check_anon_rmap() anymore.
      
      Link: https://lkml.kernel.org/r/20210128110209.50857-1-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90aaca85
    • Miaohe Lin's avatar
      mm/rmap: remove unneeded semicolon in page_not_mapped() · e0af87ff
      Miaohe Lin authored
      
      
      Remove extra semicolon without any functional change intended.
      
      Link: https://lkml.kernel.org/r/20210127093425.39640-1-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0af87ff
    • Miaohe Lin's avatar
      mm/rmap: correct some obsolete comments of anon_vma · aaf1f990
      Miaohe Lin authored
      commit 2b575eb6 ("mm: convert anon_vma->lock to a mutex") changed
      spinlock used to serialize access to vma list to mutex.  And further, the
      commit 5a505085
      
       ("mm/rmap: Convert the struct anon_vma::mutex to an
      rwsem") converted the mutex to an rwsem for solving scalability problem.
      So replace spinlock with rwsem to make comment uptodate.
      
      Link: https://lkml.kernel.org/r/20210123072459.25903-1-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aaf1f990
    • Miaohe Lin's avatar
      mm/mlock: stop counting mlocked pages when none vma is found · 48b03eea
      Miaohe Lin authored
      
      
      There will be no vma satisfies addr < vm_end when find_vma() returns NULL.
      Thus it's meaningless to traverse the vma list below because we can't
      find any vma to count mlocked pages.  Stop counting mlocked pages in this
      case to save some vma list traversal cycles.
      
      Link: https://lkml.kernel.org/r/20210204110705.17586-1-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48b03eea
    • David Hildenbrand's avatar
      virtio-mem: check against mhp_get_pluggable_range() which memory we can hotplug · 94c89453
      David Hildenbrand authored
      
      
      Right now, we only check against MAX_PHYSMEM_BITS - but turns out there
      are more restrictions of which memory we can actually hotplug, especially
      om arm64 or s390x once we support them: we might receive something like
      -E2BIG or -ERANGE from add_memory_driver_managed(), stopping device
      operation.
      
      So, check right when initializing the device which memory we can add,
      warning the user.  Try only adding actually pluggable ranges: in the worst
      case, no memory provided by our device is pluggable.
      
      In the usual case, we expect all device memory to be pluggable, and in
      corner cases only some memory at the end of the device-managed memory
      region to not be pluggable.
      
      Link: https://lkml.kernel.org/r/1612149902-7867-5-git-send-email-anshuman.khandual@arm.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: teawater <teawaterz@linux.alibaba.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94c89453
    • Anshuman Khandual's avatar
      s390/mm: define arch_get_mappable_range() · 7707248a
      Anshuman Khandual authored
      
      
      This overrides arch_get_mappabble_range() on s390 platform which will be
      used with recently added generic framework.  It modifies the existing
      range check in vmem_add_mapping() using arch_get_mappable_range().  It
      also adds a VM_BUG_ON() check that would ensure that mhp_range_allowed()
      has already been called on the hotplug path.
      
      Link: https://lkml.kernel.org/r/1612149902-7867-4-git-send-email-anshuman.khandual@arm.com
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: teawater <teawaterz@linux.alibaba.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7707248a
    • Anshuman Khandual's avatar
      arm64/mm: define arch_get_mappable_range() · 03aaf83f
      Anshuman Khandual authored
      
      
      This overrides arch_get_mappable_range() on arm64 platform which will be
      used with recently added generic framework.  It drops
      inside_linear_region() and subsequent check in arch_add_memory() which are
      no longer required.  It also adds a VM_BUG_ON() check that would ensure
      that mhp_range_allowed() has already been called.
      
      Link: https://lkml.kernel.org/r/1612149902-7867-3-git-send-email-anshuman.khandual@arm.com
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: teawater <teawaterz@linux.alibaba.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03aaf83f
    • Anshuman Khandual's avatar
      mm/memory_hotplug: prevalidate the address range being added with platform · bca3feaa
      Anshuman Khandual authored
      
      
      Patch series "mm/memory_hotplug: Pre-validate the address range with platform", v5.
      
      This series adds a mechanism allowing platforms to weigh in and
      prevalidate incoming address range before proceeding further with the
      memory hotplug.  This helps prevent potential platform errors for the
      given address range, down the hotplug call chain, which inevitably fails
      the hotplug itself.
      
      This mechanism was suggested by David Hildenbrand during another
      discussion with respect to a memory hotplug fix on arm64 platform.
      
      https://lore.kernel.org/linux-arm-kernel/1600332402-30123-1-git-send-email-anshuman.khandual@arm.com/
      
      This mechanism focuses on the addressibility aspect and not [sub] section
      alignment aspect.  Hence check_hotplug_memory_range() and check_pfn_span()
      have been left unchanged.
      
      This patch (of 4):
      
      This introduces mhp_range_allowed() which can be called in various memory
      hotplug paths to prevalidate the address range which is being added, with
      the platform.  Then mhp_range_allowed() calls mhp_get_pluggable_range()
      which provides applicable address range depending on whether linear
      mapping is required or not.  For ranges that require linear mapping, it
      calls a new arch callback arch_get_mappable_range() which the platform can
      override.  So the new callback, in turn provides the platform an
      opportunity to configure acceptable memory hotplug address ranges in case
      there are constraints.
      
      This mechanism will help prevent platform specific errors deep down during
      hotplug calls.  This drops now redundant
      check_hotplug_memory_addressable() check in __add_pages() but instead adds
      a VM_BUG_ON() check which would ensure that the range has been validated
      with mhp_range_allowed() earlier in the call chain.  Besides
      mhp_get_pluggable_range() also can be used by potential memory hotplug
      callers to avail the allowed physical range which would go through on a
      given platform.
      
      This does not really add any new range check in generic memory hotplug but
      instead compensates for lost checks in arch_add_memory() where applicable
      and check_hotplug_memory_addressable(), with unified mhp_range_allowed().
      
      [akpm@linux-foundation.org: make pagemap_range() return -EINVAL when mhp_range_allowed() fails]
      
      Link: https://lkml.kernel.org/r/1612149902-7867-1-git-send-email-anshuman.khandual@arm.com
      Link: https://lkml.kernel.org/r/1612149902-7867-2-git-send-email-anshuman.khandual@arm.com
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com> # s390
      Cc: Will Deacon <will@kernel.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: teawater <teawaterz@linux.alibaba.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bca3feaa
    • David Hildenbrand's avatar
      Documentation: sysfs/memory: clarify some memory block device properties · a89107c0
      David Hildenbrand authored
      In commit 53cdc1cb
      
       ("drivers/base/memory.c: indicate all memory blocks
      as removable") we changed the output of the "removable" property of memory
      devices to return "1" if and only if the kernel supports memory offlining.
      
      Let's update documentation, stating that the interface is legacy.  Also
      update documentation of the "state" property and "valid_zones" properties.
      
      Link: https://lkml.kernel.org/r/20210201181347.13262-3-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Cc: Geert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a89107c0
    • David Hildenbrand's avatar
      drivers/base/memory: don't store phys_device in memory blocks · e9a2e48e
      David Hildenbrand authored
      No need to store the value for each and every memory block, as we can
      easily query the value at runtime.  Reshuffle the members to optimize the
      memory layout.  Also, let's clarify what the interface once was used for
      and why it's legacy nowadays.
      
      "phys_device" was used on s390x in older versions of lsmem[2]/chmem[3],
      back when they were still part of s390x-tools.  They were later replaced
      by the variants in linux-utils.  For example, RHEL6 and RHEL7 contain
      lsmem/chmem from s390-utils.  RHEL8 switched to versions from util-linux
      on s390x [4].
      
      "phys_device" was added with sysfs support for memory hotplug in commit
      3947be19 ("[PATCH] memory hotplug: sysfs and add/remove functions") in
      2005.  It always returned 0.
      
      s390x started returning something != 0 on some setups (if sclp.rzm is set
      by HW) in 2010 via commit 57b552ba ("memory hotplug/s390: set
      phys_device").
      
      For s390x, it allowed for identifying which memory block devices belong to
      the same storage increment (RZM).  Only if all memory block devices
      comprising a single storage increment were offline, the memory could
      actually be removed in the hypervisor.
      
      Since commit e5d709bb
      
       ("s390/memory hotplug: provide
      memory_block_size_bytes() function") in 2013 a memory block device spans
      at least one storage increment - which is why the interface isn't really
      helpful/used anymore (except by old lsmem/chmem tools).
      
      There were once RFC patches to make use of "phys_device" in ACPI context;
      however, the underlying problem could be solved using different interfaces
      [1].
      
      [1] https://patchwork.kernel.org/patch/2163871/
      [2] https://github.com/ibm-s390-tools/s390-tools/blob/v2.1.0/zconf/lsmem
      [3] https://github.com/ibm-s390-tools/s390-tools/blob/v2.1.0/zconf/chmem
      [4] https://bugzilla.redhat.com/show_bug.cgi?id=1504134
      
      Link: https://lkml.kernel.org/r/20210201181347.13262-2-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Vaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Tom Rix <trix@redhat.com>
      Cc: Geert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9a2e48e
    • Miaohe Lin's avatar
      mm/memory_hotplug: use helper function zone_end_pfn() to get end_pfn · 6c922cf7
      Miaohe Lin authored
      Commit 108bcc96
      
       ("mm: add & use zone_end_pfn() and zone_spans_pfn()")
      introduced the helper zone_end_pfn() to calculate the zone end pfn.  But
      update_pgdat_span() forgot to use it.
      
      Use this helper and rename local variable zone_end_pfn to end_pfn to avoid
      a naming conflict with the existing zone_end_pfn().
      
      Link: https://lkml.kernel.org/r/20210127093211.37714-1-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c922cf7
    • David Hildenbrand's avatar
      mm/memory_hotplug: MEMHP_MERGE_RESOURCE -> MHP_MERGE_RESOURCE · 26011267
      David Hildenbrand authored
      
      
      Let's make "MEMHP_MERGE_RESOURCE" consistent with "MHP_NONE", "mhp_t" and
      "mhp_flags".  As discussed recently [1], "mhp" is our internal acronym for
      memory hotplug now.
      
      [1] https://lore.kernel.org/linux-mm/c37de2d0-28a1-4f7d-f944-cfd7d81c334d@redhat.com/
      
      Link: https://lkml.kernel.org/r/20210126115829.10909-1-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarWei Liu <wei.liu@kernel.org>
      Reviewed-by: default avatarPankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      26011267
    • Anshuman Khandual's avatar
      mm/memory_hotplug: rename all existing 'memhp' into 'mhp' · 1adf8b46
      Anshuman Khandual authored
      
      
      This renames all 'memhp' instances to 'mhp' except for memhp_default_state
      for being a kernel command line option.  This is just a clean up and
      should not cause a functional change.  Let's make it consistent rater than
      mixing the two prefixes.  In preparation for more users of the 'mhp'
      terminology.
      
      Link: https://lkml.kernel.org/r/1611554093-27316-1-git-send-email-anshuman.khandual@arm.com
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1adf8b46
    • Dan Williams's avatar
      mm: fix memory_failure() handling of dax-namespace metadata · 34dc45be
      Dan Williams authored
      Given 'struct dev_pagemap' spans both data pages and metadata pages be
      careful to consult the altmap if present to delineate metadata.  In fact
      the pfn_first() helper already identifies the first valid data pfn, so
      export that helper for other code paths via pgmap_pfn_valid().
      
      Other usage of get_dev_pagemap() are not a concern because those are
      operating on known data pfns having been looked up by get_user_pages().
      I.e.  metadata pfns are never user mapped.
      
      Link: https://lkml.kernel.org/r/161058501758.1840162.4239831989762604527.stgit@dwillia2-desk3.amr.corp.intel.com
      Fixes: 6100e34b
      
       ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages")
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34dc45be
    • Dan Williams's avatar
      mm: teach pfn_to_online_page() about ZONE_DEVICE section collisions · 1f90a347
      Dan Williams authored
      While pfn_to_online_page() is able to determine pfn_valid() at subsection
      granularity it is not able to reliably determine if a given pfn is also
      online if the section is mixes ZONE_{NORMAL,MOVABLE} with ZONE_DEVICE.
      This means that pfn_to_online_page() may return invalid @page objects.
      For example with a memory map like:
      
      100000000-1fbffffff : System RAM
        142000000-143002e16 : Kernel code
        143200000-143713fff : Kernel rodata
        143800000-143b15b7f : Kernel data
        144227000-144ffffff : Kernel bss
      1fc000000-2fbffffff : Persistent Memory (legacy)
        1fc000000-2fbffffff : namespace0.0
      
      This command:
      
      echo 0x1fc000000 > /sys/devices/system/memory/soft_offline_page
      
      ...succeeds when it should fail.  When it succeeds it touches an
      uninitialized page and may crash or cause other damage (see
      dissolve_free_huge_page()).
      
      While the memory map above is contrived via the memmap=ss!nn kernel
      command line option, the collision happens in practice on shipping
      platforms.  The memory controller resources that decode spans of physical
      address space are a limited resource.  One technique platform-firmware
      uses to conserve those resources is to share a decoder across 2 devices to
      keep the address range contiguous.  Unfortunately the unit of operation of
      a decoder is 64MiB while the Linux section size is 128MiB.  This results
      in situations where, without subsection hotplug memory mappings with
      different lifetimes collide into one object that can only express one
      lifetime.
      
      Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a
      section that mixes ZONE_DEVICE pfns with other online pfns.  With
      SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall back
      to a slow-path check for ZONE_DEVICE pfns in an online section.  In the
      fast path online_section() for a full ZONE_DEVICE section returns false.
      
      Because the collision case is rare, and for simplicity, the
      SECTION_TAINT_ZONE_DEVICE flag is never cleared once set.
      
      [dan.j.williams@intel.com: fix CONFIG_ZONE_DEVICE=n build]
        Link: https://lkml.kernel.org/r/CAPcyv4iX+7LAgAeSqx7Zw-Zd=ZV9gBv8Bo7oTbwCOOqJoZ3+Yg@mail.gmail.com
      
      Link: https://lkml.kernel.org/r/161058500675.1840162.7887862152161279354.stgit@dwillia2-desk3.amr.corp.intel.com
      Fixes: ba72b4c8
      
       ("mm/sparsemem: support sub-section hotplug")
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f90a347
    • Dan Williams's avatar
      mm: teach pfn_to_online_page() to consider subsection validity · 9f9b02e5
      Dan Williams authored
      pfn_to_online_page is primarily used to filter out offline or fully
      uninitialized pages.  pfn_valid resp.  online_section_nr have a coarse
      per memory section granularity.  If a section shared with a partially
      offline memory (e.g.  part of ZONE_DEVICE) then pfn_to_online_page
      would lead to a false positive on some pfns.  Fix this by adding
      pfn_section_valid check which is subsection aware.
      
      [mhocko@kernel.org: changelog rewrite]
      
      Link: https://lkml.kernel.org/r/161058500148.1840162.4365921007820501696.stgit@dwillia2-desk3.amr.corp.intel.com
      Fixes: b13bc351
      
       ("mm/hotplug: invalid PFNs from pfn_to_online_page()")
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f9b02e5
    • Dan Williams's avatar
      mm: move pfn_to_online_page() out of line · 9f605f26
      Dan Williams authored
      
      
      Patch series "mm: Fix pfn_to_online_page() with respect to ZONE_DEVICE", v4.
      
      A pfn-walker that uses pfn_to_online_page() may inadvertently translate a
      pfn as online and in the page allocator, when it is offline managed by a
      ZONE_DEVICE mapping (details in Patch 3: ("mm: Teach pfn_to_online_page()
      about ZONE_DEVICE section collisions")).
      
      The 2 proposals under consideration are teach pfn_to_online_page() to be
      precise in the presence of mixed-zone sections, or teach the memory-add
      code to drop the System RAM associated with ZONE_DEVICE collisions.  In
      order to not regress memory capacity by a few 10s to 100s of MiB the
      approach taken in this set is to add precision to pfn_to_online_page().
      
      In the course of validating pfn_to_online_page() a couple other fixes
      fell out:
      
      1/ soft_offline_page() fails to drop the reference taken in the
         madvise(..., MADV_SOFT_OFFLINE) case.
      
      2/ memory_failure() uses get_dev_pagemap() to lookup ZONE_DEVICE pages,
         however that mapping may contain data pages and metadata raw pfns.
         Introduce pgmap_pfn_valid() to delineate the 2 types and fail the
         handling of raw metadata pfns.
      
      This patch (of 4);
      
      pfn_to_online_page() is already too large to be a macro or an inline
      function.  In anticipation of further logic changes / growth, move it out
      of line.
      
      No functional change, just code movement.
      
      Link: https://lkml.kernel.org/r/161058499000.1840162.702316708443239771.stgit@dwillia2-desk3.amr.corp.intel.com
      Link: https://lkml.kernel.org/r/161058499608.1840162.10165648147615238793.stgit@dwillia2-desk3.amr.corp.intel.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarMichal Hocko <mhocko@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f605f26
    • Jiang Biao's avatar
      mm/vmstat.c: erase latency in vmstat_shepherd · fbcc8183
      Jiang Biao authored
      
      
      Many 100us+ latencies have been deteceted in vmstat_shepherd() on CPX
      platform which has 208 logic cpus.  And vmstat_shepherd is queued every
      second, which could make the case worse.
      
      Add schedule point in vmstat_shepherd() to erase the latency.
      
      Link: https://lkml.kernel.org/r/20210111035526.1511-1-benbjiang@tencent.com
      Signed-off-by: default avatarJiang Biao <benbjiang@tencent.com>
      Reported-by: default avatarBin Lai <robinlai@tencent.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fbcc8183
    • Johannes Weiner's avatar
      mm: vmstat: add some comments on internal storage of byte items · 629484ae
      Johannes Weiner authored
      
      
      Byte-accounted items are used for slab object accounting at the cgroup
      level, because the objects in a slab page can belong to different cgroups.
      At the global level these items always change in multiples of whole slab
      pages.  The vmstat code exploits this and stores these items as pages
      internally, which allows for more compact per-cpu data.
      
      This optimization isn't self-evident from the asserts and the division in
      the stat update functions.  Provide the reader with some context.
      
      Link: https://lkml.kernel.org/r/20210202184411.118614-1-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      629484ae
    • Johannes Weiner's avatar
      mm: vmstat: fix NOHZ wakeups for node stat changes · 2bbd00ae
      Johannes Weiner authored
      On NOHZ, the periodic vmstat flushers on each CPU can go to sleep and
      won't wake up until stat changes are detected in the per-cpu deltas of the
      zone vmstat counters.
      
      In commit 75ef7184 ("mm, vmstat: add infrastructure for per-node
      vmstats") per-node counters were introduced, and subsequently most stats
      were moved from the zone to the node level.  However, the node counters
      weren't added to the NOHZ wakeup detection.
      
      In theory this can cause per-cpu errors to remain in the user-reported
      stats indefinitely.  In practice this only affects a handful of sub
      counters (file_mapped, dirty and writeback e.g.) because other page state
      changes at the node level likely involve a change at the zone level as
      well (alloc and free, lru ops).  Also, nobody has complained.
      
      Fix it up for completeness: wake up vmstat refreshing on node changes.
      Also remove the BUILD_BUG_ONs that assert counter size; we haven't relied
      on it since we added sizeof() to the range calculation in commit
      13c9aaf7
      
       ("mm/vmstat.c: fix NUMA statistics updates").
      
      Link: https://lkml.kernel.org/r/20210202184342.118513-1-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2bbd00ae
    • Patrick Daly's avatar
      mm: cma: print region name on failure · a052d4d1
      Patrick Daly authored
      
      
      Print the name of the CMA region for convenience.  This is useful
      information to have when cma_alloc() fails.
      
      [pdaly@codeaurora.org: print the "count" variable]
        Link: https://lkml.kernel.org/r/20210209142414.12768-1-georgi.djakov@linaro.org
      
      Link: https://lkml.kernel.org/r/20210208115200.20286-1-georgi.djakov@linaro.org
      Signed-off-by: default avatarPatrick Daly <pdaly@codeaurora.org>
      Signed-off-by: default avatarGeorgi Djakov <georgi.djakov@linaro.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a052d4d1
    • David Hildenbrand's avatar
      mm/page_alloc: count CMA pages per zone and print them in /proc/zoneinfo · 3c381db1
      David Hildenbrand authored
      
      
      Let's count the number of CMA pages per zone and print them in
      /proc/zoneinfo.
      
      Having access to the total number of CMA pages per zone is helpful for
      debugging purposes to know where exactly the CMA pages ended up, and to
      figure out how many pages of a zone might behave differently, even after
      some of these pages might already have been allocated.
      
      As one example, CMA pages part of a kernel zone cannot be used for
      ordinary kernel allocations but instead behave more like ZONE_MOVABLE.
      
      For now, we are only able to get the global nr+free cma pages from
      /proc/meminfo and the free cma pages per zone from /proc/zoneinfo.
      
      Example after this patch when booting a 6 GiB QEMU VM with
      "hugetlb_cma=2G":
        # cat /proc/zoneinfo | grep cma
                cma      0
              nr_free_cma  0
                cma      0
              nr_free_cma  0
                cma      524288
              nr_free_cma  493016
                cma      0
                cma      0
        # cat /proc/meminfo | grep Cma
        CmaTotal:        2097152 kB
        CmaFree:         1972064 kB
      
      Note: We print even without CONFIG_CMA, just like "nr_free_cma"; this way,
            one can be sure when spotting "cma 0", that there are definetly no
            CMA pages located in a zone.
      
      [david@redhat.com: v2]
        Link: https://lkml.kernel.org/r/20210128164533.18566-1-david@redhat.com
      [david@redhat.com: v3]
        Link: https://lkml.kernel.org/r/20210129113451.22085-1-david@redhat.com
      
      Link: https://lkml.kernel.org/r/20210127101813.6370-3-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c381db1
    • David Hildenbrand's avatar
      mm/cma: expose all pages to the buddy if activation of an area fails · 072355c1
      David Hildenbrand authored
      
      
      Right now, if activation fails, we might already have exposed some pages
      to the buddy for CMA use (although they will never get actually used by
      CMA), and some pages won't be exposed to the buddy at all.
      
      Let's check for "single zone" early and on error, don't expose any pages
      for CMA use - instead, expose them to the buddy available for any use.
      Simply call free_reserved_page() on every single page - easier than going
      via free_reserved_area(), converting back and forth between pfns and virt
      addresses.
      
      In addition, make sure to fixup totalcma_pages properly.
      
      Example: 6 GiB QEMU VM with "... hugetlb_cma=2G movablecore=20% ...":
        [    0.006891] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
        [    0.006893] cma: Reserved 2048 MiB at 0x0000000100000000
        [    0.006893] hugetlb_cma: reserved 2048 MiB on node 0
        ...
        [    0.175433] cma: CMA area hugetlb0 could not be activated
      
      Before this patch:
        # cat /proc/meminfo
        MemTotal:        5867348 kB
        MemFree:         5692808 kB
        MemAvailable:    5542516 kB
        ...
        CmaTotal:        2097152 kB
        CmaFree:         1884160 kB
      
      After this patch:
        # cat /proc/meminfo
        MemTotal:        6077308 kB
        MemFree:         5904208 kB
        MemAvailable:    5747968 kB
        ...
        CmaTotal:              0 kB
        CmaFree:               0 kB
      
      Note: cma_init_reserved_mem() makes sure that we always cover full
      pageblocks / MAX_ORDER - 1 pages.
      
      Link: https://lkml.kernel.org/r/20210127101813.6370-2-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      072355c1
    • Roman Gushchin's avatar
      mm: cma: allocate cma areas bottom-up · df2ff39e
      Roman Gushchin authored
      
      
      Currently cma areas without a fixed base are allocated close to the end of
      the node.  This placement is sub-optimal because of compaction: it brings
      pages into the cma area.  In particular, it can bring in hot executable
      pages, even if there is a plenty of free memory on the machine.  This
      results in cma allocation failures.
      
      Instead let's place cma areas close to the beginning of a node.  In this
      case the compaction will help to free cma areas, resulting in better cma
      allocation success rates.
      
      If there is enough memory let's try to allocate bottom-up starting with
      4GB to exclude any possible interference with DMA32.  On smaller machines
      or in a case of a failure, stick with the old behavior.
      
      16GB vm, 2GB cma area:
      With this patch:
      [    0.000000] Command line: root=/dev/vda3 rootflags=subvol=/root systemd.unified_cgroup_hierarchy=1 enforcing=0 console=ttyS0,115200 hugetlb_cma=2G
      [    0.002928] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
      [    0.002930] cma: Reserved 2048 MiB at 0x0000000100000000
      [    0.002931] hugetlb_cma: reserved 2048 MiB on node 0
      
      Without this patch:
      [    0.000000] Command line: root=/dev/vda3 rootflags=subvol=/root systemd.unified_cgroup_hierarchy=1 enforcing=0 console=ttyS0,115200 hugetlb_cma=2G
      [    0.002930] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
      [    0.002933] cma: Reserved 2048 MiB at 0x00000003c0000000
      [    0.002934] hugetlb_cma: reserved 2048 MiB on node 0
      
      v2:
        - switched to memblock_set_bottom_up(true), by Mike
        - start with 4GB, by Mike
      
      [guro@fb.com: whitespace fix, per Mike]
        Link: https://lkml.kernel.org/r/20201221170551.GB3428478@carbon.DHCP.thefacebook.com
      [guro@fb.com: fix 32-bit warnings]
        Link: https://lkml.kernel.org/r/20201223163537.GA4011967@carbon.DHCP.thefacebook.com
      [guro@fb.com: fix 32-bit systems]
      [akpm@linux-foundation.org: build fix]
      
      Link: https://lkml.kernel.org/r/20201217201214.3414100-1-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Wonhyuk Yang <vvghjk1234@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df2ff39e
    • Rik van Riel's avatar
      mm,shmem,thp: limit shmem THP allocations to requested zones · 187df5dd
      Rik van Riel authored
      
      
      Hugh pointed out that the gma500 driver uses shmem pages, but needs to
      limit them to the DMA32 zone.  Ensure the allocations resulting from the
      gfp_mask returned by limit_gfp_mask use the zone flags that were
      originally passed to shmem_getpage_gfp.
      
      Link: https://lkml.kernel.org/r/20210224121016.1314ed6d@imladris.surriel.com
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Suggested-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xu Yu <xuyu@linux.alibaba.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      187df5dd
    • Rik van Riel's avatar
      mm,thp,shmem: make khugepaged obey tmpfs mount flags · cd89fb06
      Rik van Riel authored
      Currently if thp enabled=[madvise], mounting a tmpfs filesystem with
      huge=always and mmapping files from that tmpfs does not result in
      khugepaged collapsing those mappings, despite the mount flag indicating
      that it should.
      
      Fix that by breaking up the blocks of tests in hugepage_vma_check a little
      bit, and testing things in the correct order.
      
      Link: https://lkml.kernel.org/r/20201124194925.623931-4-riel@surriel.com
      Fixes: c2231020
      
       ("mm: thp: register mm for khugepaged when merging vma for shmem")
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xu Yu <xuyu@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd89fb06
    • Rik van Riel's avatar
      mm,thp,shm: limit gfp mask to no more than specified · 78cc8cdc
      Rik van Riel authored
      
      
      Matthew Wilcox pointed out that the i915 driver opportunistically
      allocates tmpfs memory, but will happily reclaim some of its pool if no
      memory is available.
      
      Make sure the gfp mask used to opportunistically allocate a THP is always
      at least as restrictive as the original gfp mask.
      
      Link: https://lkml.kernel.org/r/20201124194925.623931-3-riel@surriel.com
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xu Yu <xuyu@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78cc8cdc
    • Rik van Riel's avatar
      mm,thp,shmem: limit shmem THP alloc gfp_mask · 164cc4fe
      Rik van Riel authored
      
      
      Patch series "mm,thp,shm: limit shmem THP alloc gfp_mask", v6.
      
      The allocation flags of anonymous transparent huge pages can be controlled
      through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
      help the system from getting bogged down in the page reclaim and
      compaction code when many THPs are getting allocated simultaneously.
      
      However, the gfp_mask for shmem THP allocations were not limited by those
      configuration settings, and some workloads ended up with all CPUs stuck on
      the LRU lock in the page reclaim code, trying to allocate dozens of THPs
      simultaneously.
      
      This patch applies the same configurated limitation of THPs to shmem
      hugepage allocations, to prevent that from happening.
      
      This way a THP defrag setting of "never" or "defer+madvise" will result in
      quick allocation failures without direct reclaim when no 2MB free pages
      are available.
      
      With this patch applied, THP allocations for tmpfs will be a little more
      aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
      less aggressive for files that are not mmapped or mapped without that
      flag.
      
      This patch (of 4):
      
      The allocation flags of anonymous transparent huge pages can be controlled
      through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
      help the system from getting bogged down in the page reclaim and
      compaction code when many THPs are getting allocated simultaneously.
      
      However, the gfp_mask for shmem THP allocations were not limited by those
      configuration settings, and some workloads ended up with all CPUs stuck on
      the LRU lock in the page reclaim code, trying to allocate dozens of THPs
      simultaneously.
      
      This patch applies the same configurated limitation of THPs to shmem
      hugepage allocations, to prevent that from happening.
      
      Controlling the gfp_mask of THP allocations through the knobs in sysfs
      allows users to determine the balance between how aggressively the system
      tries to allocate THPs at fault time, and how much the application may end
      up stalling attempting those allocations.
      
      This way a THP defrag setting of "never" or "defer+madvise" will result in
      quick allocation failures without direct reclaim when no 2MB free pages
      are available.
      
      With this patch applied, THP allocations for tmpfs will be a little more
      aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
      less aggressive for files that are not mmapped or mapped without that
      flag.
      
      Link: https://lkml.kernel.org/r/20201124194925.623931-1-riel@surriel.com
      Link: https://lkml.kernel.org/r/20201124194925.623931-2-riel@surriel.com
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Xu Yu <xuyu@linux.alibaba.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      164cc4fe
    • Matthew Wilcox (Oracle)'s avatar
      mm: remove pagevec_lookup_entries · a656a202
      Matthew Wilcox (Oracle) authored
      
      
      pagevec_lookup_entries() is now just a wrapper around find_get_entries()
      so remove it and convert all its callers.
      
      Link: https://lkml.kernel.org/r/20201112212641.27837-15-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a656a202
    • Matthew Wilcox (Oracle)'s avatar
      mm: pass pvec directly to find_get_entries · cf2039af
      Matthew Wilcox (Oracle) authored
      
      
      All callers of find_get_entries() use a pvec, so pass it directly instead
      of manipulating it in the caller.
      
      Link: https://lkml.kernel.org/r/20201112212641.27837-14-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf2039af
    • Matthew Wilcox (Oracle)'s avatar
      mm: remove nr_entries parameter from pagevec_lookup_entries · 38cefeb3
      Matthew Wilcox (Oracle) authored
      
      
      All callers want to fetch the full size of the pvec.
      
      Link: https://lkml.kernel.org/r/20201112212641.27837-13-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38cefeb3
    • Matthew Wilcox (Oracle)'s avatar
      mm: add an 'end' parameter to pagevec_lookup_entries · 31d270fd
      Matthew Wilcox (Oracle) authored
      
      
      Simplifies the callers and uses the existing functionality in
      find_get_entries().  We can also drop the final argument of
      truncate_exceptional_pvec_entries() and simplify the logic in that
      function.
      
      Link: https://lkml.kernel.org/r/20201112212641.27837-12-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31d270fd
    • Matthew Wilcox (Oracle)'s avatar
      mm: add an 'end' parameter to find_get_entries · ca122fe4
      Matthew Wilcox (Oracle) authored
      
      
      This simplifies the callers and leads to a more efficient implementation
      since the XArray has this functionality already.
      
      Link: https://lkml.kernel.org/r/20201112212641.27837-11-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca122fe4
    • Matthew Wilcox (Oracle)'s avatar
      mm: add and use find_lock_entries · 5c211ba2
      Matthew Wilcox (Oracle) authored
      
      
      We have three functions (shmem_undo_range(), truncate_inode_pages_range()
      and invalidate_mapping_pages()) which want exactly this function, so add
      it to filemap.c.  Before this patch, shmem_undo_range() would split any
      compound page which overlaps either end of the range being punched in both
      the first and second loops through the address space.  After this patch,
      that functionality is left for the second loop, which is arguably more
      appropriate since the first loop is supposed to run through all the pages
      quickly, and splitting a page can sleep.
      
      [willy@infradead.org: add assertion]
        Link: https://lkml.kernel.org/r/20201124041507.28996-3-willy@infradead.org
      
      Link: https://lkml.kernel.org/r/20201112212641.27837-10-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c211ba2
    • Matthew Wilcox (Oracle)'s avatar
      iomap: use mapping_seek_hole_data · 54fa39ac
      Matthew Wilcox (Oracle) authored
      
      
      Enhance mapping_seek_hole_data() to handle partially uptodate pages and
      convert the iomap seek code to call it.
      
      Link: https://lkml.kernel.org/r/20201112212641.27837-9-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54fa39ac
    • Matthew Wilcox (Oracle)'s avatar
      mm/filemap: add mapping_seek_hole_data · 41139aa4
      Matthew Wilcox (Oracle) authored
      
      
      Rewrite shmem_seek_hole_data() and move it to filemap.c.
      
      [willy@infradead.org: don't put an xa_is_value() page]
        Link: https://lkml.kernel.org/r/20201124041507.28996-4-willy@infradead.org
      
      Link: https://lkml.kernel.org/r/20201112212641.27837-8-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41139aa4
    • Matthew Wilcox (Oracle)'s avatar
      mm/filemap: add helper for finding pages · c7bad633
      Matthew Wilcox (Oracle) authored
      
      
      There is a lot of common code in find_get_entries(),
      find_get_pages_range() and find_get_pages_range_tag().  Factor out
      find_get_entry() which simplifies all three functions.
      
      [willy@infradead.org: remove VM_BUG_ON_PAGE()]
        Link: https://lkml.kernel.org/r/20201124041507.28996-2-willy@infradead.orgLink: https://lkml.kernel.org/r/20201112212641.27837-7-willy@infradead.org
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7bad633
    • Matthew Wilcox (Oracle)'s avatar
      mm/filemap: rename find_get_entry to mapping_get_entry · bc5a3011
      Matthew Wilcox (Oracle) authored
      
      
      find_get_entry doesn't "find" anything.  It returns the entry at a
      particular index.
      
      Link: https://lkml.kernel.org/r/20201112212641.27837-6-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc5a3011
    • Matthew Wilcox (Oracle)'s avatar
      mm: add FGP_ENTRY · 44835d20
      Matthew Wilcox (Oracle) authored
      
      
      The functionality of find_lock_entry() and find_get_entry() can be
      provided by pagecache_get_page(), which lets us delete find_lock_entry()
      and make find_get_entry() static.
      
      Link: https://lkml.kernel.org/r/20201112212641.27837-5-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      44835d20