Skip to content
  1. Sep 25, 2019
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc: do not keep unpurged areas in the busy tree · dd3b8353
      Uladzislau Rezki (Sony) authored
      
      
      The busy tree can be quite big, even though the area is freed or unmapped
      it still stays there until "purge" logic removes it.
      
      1) Optimize and reduce the size of "busy" tree by removing a node from
         it right away as soon as user triggers free paths.  It is possible to
         do so, because the allocation is done using another augmented tree.
      
      The vmalloc test driver shows the difference, for example the
      "fix_size_alloc_test" is ~11% better comparing with default configuration:
      
      sudo ./test_vmalloc.sh performance
      
      <default>
      Summary: fix_size_alloc_test loops: 1000000 avg: 993985 usec
      Summary: full_fit_alloc_test loops: 1000000 avg: 973554 usec
      Summary: long_busy_list_alloc_test loops: 1000000 avg: 12617652 usec
      <default>
      
      <this patch>
      Summary: fix_size_alloc_test loops: 1000000 avg: 882263 usec
      Summary: full_fit_alloc_test loops: 1000000 avg: 973407 usec
      Summary: long_busy_list_alloc_test loops: 1000000 avg: 12593929 usec
      <this patch>
      
      2) Since the busy tree now contains allocated areas only and does not
         interfere with lazily free nodes, introduce the new function
         show_purge_info() that dumps "unpurged" areas that is propagated
         through "/proc/vmallocinfo".
      
      3) Eliminate VM_LAZY_FREE flag.
      
      Link: http://lkml.kernel.org/r/20190716152656.12255-2-lpf.vector@gmail.com
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarPengfei Li <lpf.vector@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd3b8353
    • Alastair D'Silva's avatar
      mm/sparse.c: remove NULL check in clear_hwpoisoned_pages() · 5ed86703
      Alastair D'Silva authored
      There is no possibility for memmap to be NULL in the current codebase.
      
      This check was added in commit 95a4774d ("memory-hotplug: update
      mce_bad_pages when removing the memory") where memmap was originally
      inited to NULL, and only conditionally given a value.
      
      The code that could have passed a NULL has been removed by commit
      ba72b4c8
      
       ("mm/sparsemem: support sub-section hotplug"), so there is no
      longer a possibility that memmap can be NULL.
      
      Link: http://lkml.kernel.org/r/20190829035151.20975-1-alastair@d-silva.org
      Signed-off-by: default avatarAlastair D'Silva <alastair@d-silva.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ed86703
    • Alastair D'Silva's avatar
      mm/sparse.c: don't manually decrement num_poisoned_pages · 9f82883c
      Alastair D'Silva authored
      
      
      Use the function written to do it instead.
      
      Link: http://lkml.kernel.org/r/20190827053656.32191-2-alastair@au1.ibm.com
      Signed-off-by: default avatarAlastair D'Silva <alastair@d-silva.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f82883c
    • Wei Yang's avatar
      mm/sparse.c: use __nr_to_section(section_nr) to get mem_section · c1cbc3ee
      Wei Yang authored
      
      
      __pfn_to_section is defined as __nr_to_section(pfn_to_section_nr(pfn)).
      
      Since we already get section_nr, it is not necessary to get mem_section
      from start_pfn. By doing so, we reduce one redundant operation.
      
      Link: http://lkml.kernel.org/r/20190809010242.29797-1-richardw.yang@linux.intel.com
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Tested-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1cbc3ee
    • Lecopzer Chen's avatar
      mm/sparse.c: fix ALIGN() without power of 2 in sparse_buffer_alloc() · db57e98d
      Lecopzer Chen authored
      
      
      The size argument passed into sparse_buffer_alloc() has already been
      aligned with PAGE_SIZE or PMD_SIZE.
      
      If the size after aligned is not power of 2 (e.g.  0x480000), the
      PTR_ALIGN() will return wrong value.  Use roundup to round sparsemap_buf
      up to next multiple of size.
      
      Link: http://lkml.kernel.org/r/20190705114826.28586-1-lecopzer.chen@mediatek.com
      Signed-off-by: default avatarLecopzer Chen <lecopzer.chen@mediatek.com>
      Signed-off-by: default avatarMark-PK Tsai <Mark-PK.Tsai@mediatek.com>
      Cc: YJ Chiang <yj.chiang@mediatek.com>
      Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db57e98d
    • Lecopzer Chen's avatar
      mm/sparse.c: fix memory leak of sparsemap_buf in aligned memory · ae831894
      Lecopzer Chen authored
      
      
      sparse_buffer_alloc(xsize) gets the size of memory from sparsemap_buf
      after being aligned with the size.  However, the size is at least
      PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION) and usually larger
      than PAGE_SIZE.
      
      Also, sparse_buffer_fini() only frees memory between sparsemap_buf and
      sparsemap_buf_end, since sparsemap_buf may be changed by PTR_ALIGN()
      first, the aligned space before sparsemap_buf is wasted and no one will
      touch it.
      
      In our ARM32 platform (without SPARSEMEM_VMEMMAP)
        Sparse_buffer_init
          Reserve d359c000 - d3e9c000 (9M)
        Sparse_buffer_alloc
          Alloc   d3a00000 - d3E80000 (4.5M)
        Sparse_buffer_fini
          Free    d3e80000 - d3e9c000 (~=100k)
       The reserved memory between d359c000 - d3a00000 (~=4.4M) is unfreed.
      
      In ARM64 platform (with SPARSEMEM_VMEMMAP)
      
        sparse_buffer_init
          Reserve ffffffc07d623000 - ffffffc07f623000 (32M)
        Sparse_buffer_alloc
          Alloc   ffffffc07d800000 - ffffffc07f600000 (30M)
        Sparse_buffer_fini
          Free    ffffffc07f600000 - ffffffc07f623000 (140K)
       The reserved memory between ffffffc07d623000 - ffffffc07d800000
       (~=1.9M) is unfreed.
      
      Let's explicit free redundant aligned memory.
      
      [arnd@arndb.de: mark sparse_buffer_free as __meminit]
        Link: http://lkml.kernel.org/r/20190709185528.3251709-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/20190705114730.28534-1-lecopzer.chen@mediatek.com
      Signed-off-by: default avatarLecopzer Chen <lecopzer.chen@mediatek.com>
      Signed-off-by: default avatarMark-PK Tsai <Mark-PK.Tsai@mediatek.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: YJ Chiang <yj.chiang@mediatek.com>
      Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae831894
    • Souptick Joarder's avatar
      mm/memory_hotplug.c: s/is/if · 29a90db9
      Souptick Joarder authored
      
      
      Correct typo in comment.
      
      Link: http://lkml.kernel.org/r/1568233954-3913-1-git-send-email-jrdr.linux@gmail.com
      Signed-off-by: default avatarSouptick Joarder <jrdr.linux@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29a90db9
    • David Hildenbrand's avatar
      mm/memory_hotplug: online_pages cannot be 0 in online_pages() · ca9a46f8
      David Hildenbrand authored
      
      
      walk_system_ram_range() will fail with -EINVAL in case
      online_pages_range() was never called (== no resource applicable in the
      range).  Otherwise, we will always call online_pages_range() with nr_pages
      > 0 and, therefore, have online_pages > 0.
      
      Remove that special handling.
      
      Link: http://lkml.kernel.org/r/20190814154109.3448-6-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca9a46f8
    • David Hildenbrand's avatar
      mm/memory_hotplug: make sure the pfn is aligned to the order when onlining · bd02cc01
      David Hildenbrand authored
      Commit a9cd410a
      
       ("mm/page_alloc.c: memory hotplug: free pages as
      higher order") assumed that any PFN we get via memory resources is aligned
      to to MAX_ORDER - 1, I am not convinced that is always true.  Let's play
      safe, check the alignment and fallback to single pages.
      
      akpm: warn in this situation so we get to find out if and why this ever
      occurs.
      
      [akpm@linux-foundation.org: add WARN_ON_ONCE()]
      Link: http://lkml.kernel.org/r/20190814154109.3448-5-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd02cc01
    • David Hildenbrand's avatar
      mm/memory_hotplug: simplify online_pages_range() · b2c2ab20
      David Hildenbrand authored
      
      
      online_pages always corresponds to nr_pages.  Simplify the code, getting
      rid of online_pages_blocks().  Add some comments.
      
      Link: http://lkml.kernel.org/r/20190814154109.3448-4-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2c2ab20
    • David Hildenbrand's avatar
      mm/memory_hotplug: drop PageReserved() check in online_pages_range() · 5ecae635
      David Hildenbrand authored
      
      
      move_pfn_range_to_zone() will set all pages to PG_reserved via
      memmap_init_zone().  The only way a page could no longer be reserved would
      be if a MEM_GOING_ONLINE notifier would clear PG_reserved - which is not
      done (the online_page callback is used for that purpose by e.g., Hyper-V
      instead).  walk_system_ram_range() will never call online_pages_range()
      with duplicate PFNs, so drop the PageReserved() check.
      
      This seems to be a leftover from ancient times where the memmap was
      initialized when adding memory and we wanted to check for already onlined
      memory.
      
      Link: http://lkml.kernel.org/r/20190814154109.3448-3-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ecae635
    • David Hildenbrand's avatar
      mm/memory_hotplug.c: use PFN_UP / PFN_DOWN in walk_system_ram_range() · 00ff9a91
      David Hildenbrand authored
      
      
      Patch series "mm/memory_hotplug: online_pages() cleanups", v2.
      
      Some cleanups (+ one fix for a special case) in the context of
      online_pages().
      
      This patch (of 5):
      
      This makes it clearer that we will never call func() with duplicate PFNs
      in case we have multiple sub-page memory resources.  All unaligned parts
      of PFNs are completely discarded.
      
      Link: http://lkml.kernel.org/r/20190814154109.3448-2-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00ff9a91
    • Wei Yang's avatar
      mm/memory_hotplug.c: prevent memory leak when reusing pgdat · 33fce011
      Wei Yang authored
      
      
      When offlining a node in try_offline_node(), pgdat is not released.  So
      that pgdat could be reused in hotadd_new_pgdat().  While we reallocate
      pgdat->per_cpu_nodestats if this pgdat is reused.
      
      This patch prevents the memory leak by just allocating per_cpu_nodestats
      when it is a new pgdat.
      
      Link: http://lkml.kernel.org/r/20190813020608.10194-1-richardw.yang@linux.intel.com
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <OSalvador@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33fce011
    • David Hildenbrand's avatar
      drivers/base/memory.c: don't store end_section_nr in memory blocks · b6c88d3b
      David Hildenbrand authored
      
      
      Each memory block spans the same amount of sections/pages/bytes.  The size
      is determined before the first memory block is created.  No need to store
      what we can easily calculate - and the calculations even look simpler now.
      
      Michal brought up the idea of variable-sized memory blocks.  However, if
      we ever implement something like this, we will need an API compatibility
      switch and reworks at various places (most code assumes a fixed memory
      block size).  So let's cleanup what we have right now.
      
      While at it, fix the variable naming in register_mem_sect_under_node() -
      we no longer talk about a single section.
      
      Link: http://lkml.kernel.org/r/20190809110200.2746-1-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6c88d3b
    • David Hildenbrand's avatar
      driver/base/memory.c: validate memory block size early · 902ce63b
      David Hildenbrand authored
      
      
      Let's validate the memory block size early, when initializing the memory
      device infrastructure.  Fail hard in case the value is not suitable.
      
      As nobody checks the return value of memory_dev_init(), turn it into a
      void function and fail with a panic in all scenarios instead.  Otherwise,
      we'll crash later during boot when core/drivers expect that the memory
      device infrastructure (including memory_block_size_bytes()) works as
      expected.
      
      I think long term, we should move the whole memory block size
      configuration (set_memory_block_size_order() and
      memory_block_size_bytes()) into drivers/base/memory.c.
      
      Link: http://lkml.kernel.org/r/20190806090142.22709-1-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      902ce63b
    • David Hildenbrand's avatar
      drivers/base/memory.c: fixup documentation of removable/phys_index/block_size_bytes · f915fb7f
      David Hildenbrand authored
      
      
      Let's rephrase to memory block terminology and add some further
      clarifications.
      
      Link: http://lkml.kernel.org/r/20190806080826.5963-1-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f915fb7f
    • David Hildenbrand's avatar
      drivers/base/node.c: simplify unregister_memory_block_under_nodes() · d84f2f5a
      David Hildenbrand authored
      
      
      We don't allow to offline memory block devices that belong to multiple
      numa nodes.  Therefore, such devices can never get removed.  It is
      sufficient to process a single node when removing the memory block.  No
      need to iterate over each and every PFN.
      
      We already have the nid stored for each memory block.  Make sure that the
      nid always has a sane value.
      
      Please note that checking for node_online(nid) is not required.  If we
      would have a memory block belonging to a node that is no longer offline,
      then we would have a BUG in the node offlining code.
      
      Link: http://lkml.kernel.org/r/20190719135244.15242-1-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d84f2f5a
    • David Hildenbrand's avatar
      mm/memory_hotplug: remove move_pfn_range() · 3fccb74c
      David Hildenbrand authored
      
      
      Let's remove this indirection.  We need the zone in the caller either way,
      so let's just detect it there.  Add some documentation for
      move_pfn_range_to_zone() instead.
      
      [akpm@linux-foundation.org: restore newline, per David]
      Link: http://lkml.kernel.org/r/20190724142324.3686-1-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3fccb74c
    • Kefeng Wang's avatar
      mm: do not hash address in print_bad_pte() · 6aa9b8b2
      Kefeng Wang authored
      
      
      Using %px to show the actual address in print_bad_pte()
      to help us to debug issue.
      
      Link: http://lkml.kernel.org/r/20190831011816.141002-1-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6aa9b8b2
    • Mike Rapoport's avatar
      mm: consolidate pgtable_cache_init() and pgd_cache_init() · 782de70c
      Mike Rapoport authored
      
      
      Both pgtable_cache_init() and pgd_cache_init() are used to initialize kmem
      cache for page table allocations on several architectures that do not use
      PAGE_SIZE tables for one or more levels of the page table hierarchy.
      
      Most architectures do not implement these functions and use __weak default
      NOP implementation of pgd_cache_init().  Since there is no such default
      for pgtable_cache_init(), its empty stub is duplicated among most
      architectures.
      
      Rename the definitions of pgd_cache_init() to pgtable_cache_init() and
      drop empty stubs of pgtable_cache_init().
      
      Link: http://lkml.kernel.org/r/1566457046-22637-1-git-send-email-rppt@linux.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: Will Deacon <will@kernel.org>		[arm64]
      Acked-by: Thomas Gleixner <tglx@linutronix.de>	[x86]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      782de70c
    • Mike Rapoport's avatar
      microblaze: switch to generic version of pte allocation · 1b9a9d85
      Mike Rapoport authored
      
      
      The microblaze implementation of pte_alloc_one() has a provision to
      allocated PTEs from high memory, but neither CONFIG_HIGHPTE nor pte_map*()
      versions for suitable for HIGHPTE are defined.
      
      Except that, microblaze version of pte_alloc_one() is identical to the
      generic one as well as the implementations of pte_free() and
      pte_free_kernel().
      
      Switch microblaze to use the generic versions of these functions.  Also
      remove pte_free_slow() that is not referenced anywhere in the code.
      
      Link: http://lkml.kernel.org/r/1565690952-32158-1-git-send-email-rppt@linux.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b9a9d85
    • Mike Rapoport's avatar
      sh: switch to generic version of pte allocation · 6fb12766
      Mike Rapoport authored
      
      
      The sh implementation pte_alloc_one(), pte_alloc_one_kernel(),
      pte_free_kernel() and pte_free() is identical to the generic except of
      lack of __GFP_ACCOUNT for the user PTEs allocation.
      
      Switch sh to use generic version of these functions.
      
      Link: http://lkml.kernel.org/r/1565250728-21721-4-git-send-email-rppt@linux.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6fb12766
    • Mike Rapoport's avatar
      ia64: switch to generic version of pte allocation · 01319921
      Mike Rapoport authored
      
      
      The ia64 implementation pte_alloc_one(), pte_alloc_one_kernel(),
      pte_free_kernel() and pte_free() is identical to the generic except of
      lack of __GFP_ACCOUNT for the user PTEs allocation.
      
      Switch ia64 to use generic version of these functions.
      
      Link: http://lkml.kernel.org/r/1565250728-21721-3-git-send-email-rppt@linux.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01319921
    • Nicholas Piggin's avatar
      mm: remove quicklist page table caches · 13224794
      Nicholas Piggin authored
      
      
      Patch series "mm: remove quicklist page table caches".
      
      A while ago Nicholas proposed to remove quicklist page table caches [1].
      
      I've rebased his patch on the curren upstream and switched ia64 and sh to
      use generic versions of PTE allocation.
      
      [1] https://lore.kernel.org/linux-mm/20190711030339.20892-1-npiggin@gmail.com
      
      This patch (of 3):
      
      Remove page table allocator "quicklists".  These have been around for a
      long time, but have not got much traction in the last decade and are only
      used on ia64 and sh architectures.
      
      The numbers in the initial commit look interesting but probably don't
      apply anymore.  If anybody wants to resurrect this it's in the git
      history, but it's unhelpful to have this code and divergent allocator
      behaviour for minor archs.
      
      Also it might be better to instead make more general improvements to page
      allocator if this is still so slow.
      
      Link: http://lkml.kernel.org/r/1565250728-21721-2-git-send-email-rppt@linux.ibm.com
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      13224794
    • Minchan Kim's avatar
      mm: release the spinlock on zap_pte_range · 7b167b68
      Minchan Kim authored
      
      
      In our testing (camera recording), Miguel and Wei found
      unmap_page_range() takes above 6ms with preemption disabled easily.
      When I see that, the reason is it holds page table spinlock during
      entire 512 page operation in a PMD.  6.2ms is never trivial for user
      experince if RT task couldn't run in the time because it could make
      frame drop or glitch audio problem.
      
      I had a time to benchmark it via adding some trace_printk hooks between
      pte_offset_map_lock and pte_unmap_unlock in zap_pte_range.  The testing
      device is 2018 premium mobile device.
      
      I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
      task runs on little core even though it doesn't have any IPI and LRU
      lock contention.  It's already too heavy.
      
      If I remove activate_page, 35-40% overhead of zap_pte_range is gone so
      most of overhead(about 0.7ms) comes from activate_page via
      mark_page_accessed.  Thus, if there are LRU contention, that 0.7ms could
      accumulate up to several ms.
      
      So this patch adds a check for need_resched() in the loop, and a
      preemption point if necessary.
      
      Link: http://lkml.kernel.org/r/20190731061440.GC155569@google.com
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatarMiguel de Dios <migueldedios@google.com>
      Reported-by: default avatarWei Wang <wvw@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b167b68
    • Wei Yang's avatar
      mm: remove redundant assignment of entry · 9da99f20
      Wei Yang authored
      
      
      Since ptent will not be changed after previous assignment of entry, it is
      not necessary to do the assignment again.
      
      Link: http://lkml.kernel.org/r/20190708082740.21111-1-richardw.yang@linux.intel.com
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Acked-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Will Deacon <will@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9da99f20
    • John Hubbard's avatar
      net/xdp: convert put_page() to put_user_page*() · 1edc9769
      John Hubbard authored
      For pages that were retained via get_user_pages*(), release those pages
      via the new put_user_page*() routines, instead of via put_page() or
      release_pages().
      
      This is part a tree-wide conversion, as described in fc1d8e7c
      
       ("mm:
      introduce put_user_page*(), placeholder versions").
      
      Link: http://lkml.kernel.org/r/20190724044537.10458-4-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1edc9769
    • John Hubbard's avatar
      drivers/gpu/drm/via: convert put_page() to put_user_page*() · 6f553ce4
      John Hubbard authored
      For pages that were retained via get_user_pages*(), release those pages
      via the new put_user_page*() routines, instead of via put_page() or
      release_pages().
      
      This is part a tree-wide conversion, as described in fc1d8e7c
      
       ("mm:
      introduce put_user_page*(), placeholder versions").
      
      Also reverse the order of a comparison, in order to placate checkpatch.pl.
      
      Link: http://lkml.kernel.org/r/20190724044537.10458-3-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f553ce4
    • Andrew Morton's avatar
      mm/gup: add make_dirty arg to put_user_pages_dirty_lock() · 2d15eb31
      Andrew Morton authored
      [11~From: John Hubbard <jhubbard@nvidia.com>
      Subject: mm/gup: add make_dirty arg to put_user_pages_dirty_lock()
      
      Patch series "mm/gup: add make_dirty arg to put_user_pages_dirty_lock()",
      v3.
      
      There are about 50+ patches in my tree [2], and I'll be sending out the
      remaining ones in a few more groups:
      
      * The block/bio related changes (Jerome mostly wrote those, but I've had
        to move stuff around extensively, and add a little code)
      
      * mm/ changes
      
      * other subsystem patches
      
      * an RFC that shows the current state of the tracking patch set.  That
        can only be applied after all call sites are converted, but it's good to
        get an early look at it.
      
      This is part a tree-wide conversion, as described in fc1d8e7c
      
       ("mm:
      introduce put_user_page*(), placeholder versions").
      
      This patch (of 3):
      
      Provide more capable variation of put_user_pages_dirty_lock(), and delete
      put_user_pages_dirty().  This is based on the following:
      
      1.  Lots of call sites become simpler if a bool is passed into
         put_user_page*(), instead of making the call site choose which
         put_user_page*() variant to call.
      
      2.  Christoph Hellwig's observation that set_page_dirty_lock() is
         usually correct, and set_page_dirty() is usually a bug, or at least
         questionable, within a put_user_page*() calling chain.
      
      This leads to the following API choices:
      
          * put_user_pages_dirty_lock(page, npages, make_dirty)
      
          * There is no put_user_pages_dirty(). You have to
            hand code that, in the rare case that it's
            required.
      
      [jhubbard@nvidia.com: remove unused variable in siw_free_plist()]
        Link: http://lkml.kernel.org/r/20190729074306.10368-1-jhubbard@nvidia.com
      Link: http://lkml.kernel.org/r/20190724044537.10458-2-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d15eb31
    • Johannes Weiner's avatar
      mm: vmscan: do not share cgroup iteration between reclaimers · 1ba6fc9a
      Johannes Weiner authored
      
      
      One of our services observed a high rate of cgroup OOM kills in the
      presence of large amounts of clean cache.  Debugging showed that the
      culprit is the shared cgroup iteration in page reclaim.
      
      Under high allocation concurrency, multiple threads enter reclaim at the
      same time.  Fearing overreclaim when we first switched from the single
      global LRU to cgrouped LRU lists, we introduced a shared iteration state
      for reclaim invocations - whether 1 or 20 reclaimers are active
      concurrently, we only walk the cgroup tree once: the 1st reclaimer
      reclaims the first cgroup, the second the second one etc.  With more
      reclaimers than cgroups, we start another walk from the top.
      
      This sounded reasonable at the time, but the problem is that reclaim
      concurrency doesn't scale with allocation concurrency.  As reclaim
      concurrency increases, the amount of memory individual reclaimers get to
      scan gets smaller and smaller.  Individual reclaimers may only see one
      cgroup per cycle, and that may not have much reclaimable memory.  We see
      individual reclaimers declare OOM when there is plenty of reclaimable
      memory available in cgroups they didn't visit.
      
      This patch does away with the shared iterator, and every reclaimer is
      allowed to scan the full cgroup tree and see all of reclaimable memory,
      just like it would on a non-cgrouped system.  This way, when OOM is
      declared, we know that the reclaimer actually had a chance.
      
      To still maintain fairness in reclaim pressure, disallow cgroup reclaim
      from bailing out of the tree walk early.  Kswapd and regular direct
      reclaim already don't bail, so it's not clear why limit reclaim would have
      to, especially since it only walks subtrees to begin with.
      
      This change completely eliminates the OOM kills on our service, while
      showing no signs of overreclaim - no increased scan rates, %sys time, or
      abrupt free memory spikes.  I tested across 100 machines that have 64G of
      RAM and host about 300 cgroups each.
      
      [ It's possible overreclaim never was a *practical* issue to begin
        with - it was simply a concern we had on the mailing lists at the
        time, with no real data to back it up. But we have also added more
        bail-out conditions deeper inside reclaim (e.g. the proportional
        exit in shrink_node_memcg) since. Regardless, now we have data that
        suggests full walks are more reliable and scale just fine. ]
      
      Link: http://lkml.kernel.org/r/20190812192316.13615-1-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ba6fc9a
    • Roman Gushchin's avatar
      mm: memcontrol: switch to rcu protection in drain_all_stock() · e1a366be
      Roman Gushchin authored
      Commit 72f0184c
      
       ("mm, memcg: remove hotplug locking from try_charge")
      introduced css_tryget()/css_put() calls in drain_all_stock(), which are
      supposed to protect the target memory cgroup from being released during
      the mem_cgroup_is_descendant() call.
      
      However, it's not completely safe.  In theory, memcg can go away between
      reading stock->cached pointer and calling css_tryget().
      
      This can happen if drain_all_stock() races with drain_local_stock()
      performed on the remote cpu as a result of a work, scheduled by the
      previous invocation of drain_all_stock().
      
      The race is a bit theoretical and there are few chances to trigger it, but
      the current code looks a bit confusing, so it makes sense to fix it
      anyway.  The code looks like as if css_tryget() and css_put() are used to
      protect stocks drainage.  It's not necessary because stocked pages are
      holding references to the cached cgroup.  And it obviously won't work for
      works, scheduled on other cpus.
      
      So, let's read the stock->cached pointer and evaluate the memory cgroup
      inside a rcu read section, and get rid of css_tryget()/css_put() calls.
      
      Link: http://lkml.kernel.org/r/20190802192241.3253165-1-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1a366be
    • Chris Down's avatar
      mm, memcg: throttle allocators when failing reclaim over memory.high · 0e4b01df
      Chris Down authored
      
      
      We're trying to use memory.high to limit workloads, but have found that
      containment can frequently fail completely and cause OOM situations
      outside of the cgroup.  This happens especially with swap space -- either
      when none is configured, or swap is full.  These failures often also don't
      have enough warning to allow one to react, whether for a human or for a
      daemon monitoring PSI.
      
      Here is output from a simple program showing how long it takes in usec
      (column 2) to allocate a megabyte of anonymous memory (column 1) when a
      cgroup is already beyond its memory high setting, and no swap is
      available:
      
          [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
          > --wait -t timeout 300 /root/mdf
          [...]
          95  1035
          96  1038
          97  1000
          98  1036
          99  1048
          100 1590
          101 1968
          102 1776
          103 1863
          104 1757
          105 1921
          106 1893
          107 1760
          108 1748
          109 1843
          110 1716
          111 1924
          112 1776
          113 1831
          114 1766
          115 1836
          116 1588
          117 1912
          118 1802
          119 1857
          120 1731
          [...]
          [System OOM in 2-3 seconds]
      
      The delay does go up extremely marginally past the 100MB memory.high
      threshold, as now we spend time scanning before returning to usermode, but
      it's nowhere near enough to contain growth.  It also doesn't get worse the
      more pages you have, since it only considers nr_pages.
      
      The current situation goes against both the expectations of users of
      memory.high, and our intentions as cgroup v2 developers.  In
      cgroup-v2.txt, we claim that we will throttle and only under "extreme
      conditions" will memory.high protection be breached.  Likewise, cgroup v2
      users generally also expect that memory.high should throttle workloads as
      they exceed their high threshold.  However, as seen above, this isn't
      always how it works in practice -- even on banal setups like those with no
      swap, or where swap has become exhausted, we can end up with memory.high
      being breached and us having no weapons left in our arsenal to combat
      runaway growth with, since reclaim is futile.
      
      It's also hard for system monitoring software or users to tell how bad the
      situation is, as "high" events for the memcg may in some cases be benign,
      and in others be catastrophic.  The current status quo is that we fail
      containment in a way that doesn't provide any advance warning that things
      are about to go horribly wrong (for example, we are about to invoke the
      kernel OOM killer).
      
      This patch introduces explicit throttling when reclaim is failing to keep
      memcg size contained at the memory.high setting.  It does so by applying
      an exponential delay curve derived from the memcg's overage compared to
      memory.high.  In the normal case where the memcg is either below or only
      marginally over its memory.high setting, no throttling will be performed.
      
      This composes well with system health monitoring and remediation, as these
      allocator delays are factored into PSI's memory pressure calculations.
      This both creates a mechanism system administrators or applications
      consuming the PSI interface to trivially see that the memcg in question is
      struggling and use that to make more reasonable decisions, and permits
      them enough time to act.  Either of these can act with significantly more
      nuance than that we can provide using the system OOM killer.
      
      This is a similar idea to memory.oom_control in cgroup v1 which would put
      the cgroup to sleep if the threshold was violated, but it's also
      significantly improved as it results in visible memory pressure, and also
      doesn't schedule indefinitely, which previously made tracing and other
      introspection difficult (ie.  it's clamped at 2*HZ per allocation through
      MEMCG_MAX_HIGH_DELAY_JIFFIES).
      
      Contrast the previous results with a kernel with this patch:
      
          [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
          > --wait -t timeout 300 /root/mdf
          [...]
          95  1002
          96  1000
          97  1002
          98  1003
          99  1000
          100 1043
          101 84724
          102 330628
          103 610511
          104 1016265
          105 1503969
          106 2391692
          107 2872061
          108 3248003
          109 4791904
          110 5759832
          111 6912509
          112 8127818
          113 9472203
          114 12287622
          115 12480079
          116 14144008
          117 15808029
          118 16384500
          119 16383242
          120 16384979
          [...]
      
      As you can see, in the normal case, memory allocation takes around 1000
      usec.  However, as we exceed our memory.high, things start to increase
      exponentially, but fairly leniently at first.  Our first megabyte over
      memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then the
      next is almost an entire second.  This gets worse until we reach our
      eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
      However, this is still making forward progress, so permits tracing or
      further analysis with programs like GDB.
      
      We use an exponential curve for our delay penalty for a few reasons:
      
      1. We run mem_cgroup_handle_over_high to potentially do reclaim after
         we've already performed allocations, which means that temporarily
         going over memory.high by a small amount may be perfectly legitimate,
         even for compliant workloads. We don't want to unduly penalise such
         cases.
      2. An exponential curve (as opposed to a static or linear delay) allows
         ramping up memory pressure stats more gradually, which can be useful
         to work out that you have set memory.high too low, without destroying
         application performance entirely.
      
      This patch expands on earlier work by Johannes Weiner. Thanks!
      
      [akpm@linux-foundation.org: fix max() warning]
      [akpm@linux-foundation.org: fix __udivdi3 ref on 32-bit]
      [akpm@linux-foundation.org: fix it even more]
      [chris@chrisdown.name: fix 64-bit divide even more]
      Link: http://lkml.kernel.org/r/20190723180700.GA29459@chrisdown.name
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e4b01df
    • Matthew Wilcox (Oracle)'s avatar
      mm: page cache: store only head pages in i_pages · 4101196b
      Matthew Wilcox (Oracle) authored
      
      
      Transparent Huge Pages are currently stored in i_pages as pointers to
      consecutive subpages.  This patch changes that to storing consecutive
      pointers to the head page in preparation for storing huge pages more
      efficiently in i_pages.
      
      Large parts of this are "inspired" by Kirill's patch
      https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/
      
      Kirill and Huang Ying contributed several fixes.
      
      [willy@infradead.org: use compound_nr, squish uninit-var warning]
      Link: http://lkml.kernel.org/r/20190731210400.7419-1-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox <willy@infradead.org>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarKirill Shutemov <kirill@shutemov.name>
      Reviewed-by: default avatarSong Liu <songliubraving@fb.com>
      Tested-by: default avatarSong Liu <songliubraving@fb.com>
      Tested-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Tested-by: default avatarQian Cai <cai@lca.pw>
      Tested-by: Mikhail Gavrilo...
      4101196b
    • Konstantin Khlebnikov's avatar
      mm/filemap.c: rewrite mapping_needs_writeback in less fancy manner · 875d91b1
      Konstantin Khlebnikov authored
      
      
      This actually checks that writeback is needed or in progress.
      
      Link: http://lkml.kernel.org/r/156378817069.1087.1302816672037672488.stgit@buzz
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      875d91b1
    • Konstantin Khlebnikov's avatar
      mm/filemap.c: don't initiate writeback if mapping has no dirty pages · c3aab9a0
      Konstantin Khlebnikov authored
      
      
      Functions like filemap_write_and_wait_range() should do nothing if inode
      has no dirty pages or pages currently under writeback.  But they anyway
      construct struct writeback_control and this does some atomic operations if
      CONFIG_CGROUP_WRITEBACK=y - on fast path it locks inode->i_lock and
      updates state of writeback ownership, on slow path might be more work.
      Current this path is safely avoided only when inode mapping has no pages.
      
      For example generic_file_read_iter() calls filemap_write_and_wait_range()
      at each O_DIRECT read - pretty hot path.
      
      This patch skips starting new writeback if mapping has no dirty tags set.
      If writeback is already in progress filemap_write_and_wait_range() will
      wait for it.
      
      Link: http://lkml.kernel.org/r/156378816804.1087.8607636317907921438.stgit@buzz
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3aab9a0
    • Vlastimil Babka's avatar
      mm, page_owner, debug_pagealloc: save and dump freeing stack trace · 8974558f
      Vlastimil Babka authored
      
      
      The debug_pagealloc functionality is useful to catch buggy page allocator
      users that cause e.g.  use after free or double free.  When page
      inconsistency is detected, debugging is often simpler by knowing the call
      stack of process that last allocated and freed the page.  When page_owner
      is also enabled, we record the allocation stack trace, but not freeing.
      
      This patch therefore adds recording of freeing process stack trace to page
      owner info, if both page_owner and debug_pagealloc are configured and
      enabled.  With only page_owner enabled, this info is not useful for the
      memory leak debugging use case.  dump_page() is adjusted to print the
      info.  An example result of calling __free_pages() twice may look like
      this (note the page last free stack trace):
      
      BUG: Bad page state in process bash  pfn:13d8f8
      page:ffffc31984f63e00 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x0
      flags: 0x1affff800000000()
      raw: 01affff800000000 dead000000000100 dead000000000122 0000000000000000
      raw: 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000
      page dumped because: nonzero _refcount
      page_owner tracks the page as freed
      page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL)
       prep_new_page+0x143/0x150
       get_page_from_freelist+0x289/0x380
       __alloc_pages_nodemask+0x13c/0x2d0
       khugepaged+0x6e/0xc10
       kthread+0xf9/0x130
       ret_from_fork+0x3a/0x50
      page last free stack trace:
       free_pcp_prepare+0x134/0x1e0
       free_unref_page+0x18/0x90
       khugepaged+0x7b/0xc10
       kthread+0xf9/0x130
       ret_from_fork+0x3a/0x50
      Modules linked in:
      CPU: 3 PID: 271 Comm: bash Not tainted 5.3.0-rc4-2.g07a1a73-default+ #57
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
      Call Trace:
       dump_stack+0x85/0xc0
       bad_page.cold+0xba/0xbf
       rmqueue_pcplist.isra.0+0x6c5/0x6d0
       rmqueue+0x2d/0x810
       get_page_from_freelist+0x191/0x380
       __alloc_pages_nodemask+0x13c/0x2d0
       __get_free_pages+0xd/0x30
       __pud_alloc+0x2c/0x110
       copy_page_range+0x4f9/0x630
       dup_mmap+0x362/0x480
       dup_mm+0x68/0x110
       copy_process+0x19e1/0x1b40
       _do_fork+0x73/0x310
       __x64_sys_clone+0x75/0x80
       do_syscall_64+0x6e/0x1e0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x7f10af854a10
      ...
      
      Link: http://lkml.kernel.org/r/20190820131828.22684-5-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8974558f
    • Vlastimil Babka's avatar
      mm, page_owner: keep owner info when freeing the page · 37389167
      Vlastimil Babka authored
      
      
      For debugging purposes it might be useful to keep the owner info even
      after page has been freed, and include it in e.g.  dump_page() when
      detecting a bad page state.  For that, change the PAGE_EXT_OWNER flag
      meaning to "page owner info has been set at least once" and add new
      PAGE_EXT_OWNER_ACTIVE for tracking whether page is supposed to be
      currently tracked allocated or free.  Adjust dump_page() accordingly,
      distinguishing free and allocated pages.  In the page_owner debugfs file,
      keep printing only allocated pages so that existing scripts are not
      confused, and also because free pages are irrelevant for the memory
      statistics or leak detection that's the typical use case of the file,
      anyway.
      
      Link: http://lkml.kernel.org/r/20190820131828.22684-4-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      37389167
    • Vlastimil Babka's avatar
      mm, page_owner: record page owner for each subpage · 7e2f2a0c
      Vlastimil Babka authored
      
      
      Patch series "debug_pagealloc improvements through page_owner", v2.
      
      The debug_pagealloc functionality serves a similar purpose on the page
      allocator level that slub_debug does on the kmalloc level, which is to
      detect bad users.  One notable feature that slub_debug has is storing
      stack traces of who last allocated and freed the object.  On page level we
      track allocations via page_owner, but that info is discarded when freeing,
      and we don't track freeing at all.  This series improves those aspects.
      With both debug_pagealloc and page_owner enabled, we can then get bug
      reports such as the example in Patch 4.
      
      SLUB debug tracking additionally stores cpu, pid and timestamp.  This could
      be added later, if deemed useful enough to justify the additional page_ext
      structure size.
      
      This patch (of 3):
      
      Currently, page owner info is only recorded for the first page of a
      high-order allocation, and copied to tail pages in the event of a split
      page.  With the plan to keep previous owner info after freeing the page,
      it would be benefical to record page owner for each subpage upon
      allocation.  This increases the overhead for high orders, but that should
      be acceptable for a debugging option.
      
      The order stored for each subpage is the order of the whole allocation.
      This makes it possible to calculate the "head" pfn and to recognize "tail"
      pages (quoted because not all high-order allocations are compound pages
      with true head and tail pages).  When reading the page_owner debugfs file,
      keep skipping the "tail" pages so that stats gathered by existing scripts
      don't get inflated.
      
      Link: http://lkml.kernel.org/r/20190820131828.22684-3-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e2f2a0c
    • Yu Zhao's avatar
      mm: replace list_move_tail() with add_page_to_lru_list_tail() · e7a1aaf2
      Yu Zhao authored
      
      
      This is a cleanup patch that replaces two historical uses of
      list_move_tail() with relatively recent add_page_to_lru_list_tail().
      
      Link: http://lkml.kernel.org/r/20190716212436.7137-1-yuzhao@google.com
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7a1aaf2
    • Matthew Wilcox (Oracle)'s avatar
      mm: introduce compound_nr() · d8c6546b
      Matthew Wilcox (Oracle) authored
      
      
      Replace 1 << compound_order(page) with compound_nr(page).  Minor
      improvements in readability.
      
      Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8c6546b