Skip to content
  1. Oct 08, 2016
    • Gerald Schaefer's avatar
      mm/hugetlb: check for reserved hugepages during memory offline · 082d5b6b
      Gerald Schaefer authored
      In dissolve_free_huge_pages(), free hugepages will be dissolved without
      making sure that there are enough of them left to satisfy hugepage
      reservations.
      
      Fix this by adding a return value to dissolve_free_huge_pages() and
      checking h->free_huge_pages vs.  h->resv_huge_pages.  Note that this may
      lead to the situation where dissolve_free_huge_page() returns an error
      and all free hugepages that were dissolved before that error are lost,
      while the memory block still cannot be set offline.
      
      Fixes: c8721bbb
      
       ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
      Link: http://lkml.kernel.org/r/20160926172811.94033-3-gerald.schaefer@de.ibm.com
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rui Teng <rui.teng@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      082d5b6b
    • Gerald Schaefer's avatar
      mm/hugetlb: fix memory offline with hugepage size > memory block size · 2247bb33
      Gerald Schaefer authored
      Patch series "mm/hugetlb: memory offline issues with hugepages", v4.
      
      This addresses several issues with hugepages and memory offline.  While
      the first patch fixes a panic, and is therefore rather important, the
      last patch is just a performance optimization.
      
      The second patch fixes a theoretical issue with reserved hugepages,
      while still leaving some ugly usability issue, see description.
      
      This patch (of 3):
      
      dissolve_free_huge_pages() will either run into the VM_BUG_ON() or a
      list corruption and addressing exception when trying to set a memory
      block offline that is part (but not the first part) of a "gigantic"
      hugetlb page with a size > memory block size.
      
      When no other smaller hugetlb page sizes are present, the VM_BUG_ON()
      will trigger directly.  In the other case we will run into an addressing
      exception later, because dissolve_free_huge_page() will not work on the
      head page of the compound hugetlb page which will result in a NULL
      hstate from page_hstate().
      
      To fix this, first remove the VM_BUG_ON() because it is wrong, and then
      use the compound head page in dissolve_free_huge_page().  This means
      that an unused pre-allocated gigantic page that has any part of itself
      inside the memory block that is going offline will be dissolved
      completely.  Losing an unused gigantic hugepage is preferable to failing
      the memory offline, for example in the situation where a (possibly
      faulty) memory DIMM needs to go offline.
      
      Fixes: c8721bbb
      
       ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
      Link: http://lkml.kernel.org/r/20160926172811.94033-2-gerald.schaefer@de.ibm.com
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rui Teng <rui.teng@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2247bb33
    • Wanlong Gao's avatar
      mm: nobootmem: move the comment of free_all_bootmem · 914a0516
      Wanlong Gao authored
      Commit b4def350 ("mm, nobootmem: clean-up of free_low_memory_core_early()")
      removed the unnecessary nodeid argument, after that, this comment
      becomes more confused.  We should move it to the right place.
      
      Fixes: b4def350
      
       ("mm, nobootmem: clean-up of free_low_memory_core_early()")
      Link: http://lkml.kernel.org/r/1473996082-14603-1-git-send-email-wanlong.gao@gmail.com
      Signed-off-by: default avatarWanlong Gao <wanlong.gao@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      914a0516
    • Rasmus Villemoes's avatar
      mm/shmem.c: constify anon_ops · 19938e35
      Rasmus Villemoes authored
      
      
      Every other dentry_operations instance is const, and this one might as
      well be.
      
      Link: http://lkml.kernel.org/r/1473890528-7009-1-git-send-email-linux@rasmusvillemoes.dk
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19938e35
    • Johannes Weiner's avatar
      mm: memcontrol: consolidate cgroup socket tracking · 2d758073
      Johannes Weiner authored
      
      
      The cgroup core and the memory controller need to track socket ownership
      for different purposes, but the tracking sites being entirely different
      is kind of ugly.
      
      Be a better citizen and rename the memory controller callbacks to match
      the cgroup core callbacks, then move them to the same place.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d758073
    • Baoyou Xie's avatar
      mm: move phys_mem_access_prot_allowed() declaration to pgtable.h · 08ea8c07
      Baoyou Xie authored
      
      
      We get 1 warning when building kernel with W=1:
      
        drivers/char/mem.c:220:12: warning: no previous prototype for 'phys_mem_access_prot_allowed' [-Wmissing-prototypes]
         int __weak phys_mem_access_prot_allowed(struct file *file,
      
      In fact, its declaration is spreading to several header files in
      different architecture, but need to be declare in common header file.
      
      So this patch moves phys_mem_access_prot_allowed() to pgtable.h.
      
      Link: http://lkml.kernel.org/r/1473751597-12139-1-git-send-email-baoyou.xie@linaro.org
      Signed-off-by: default avatarBaoyou Xie <baoyou.xie@linaro.org>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08ea8c07
    • Andrew Morton's avatar
      mm/page_io.c: replace some BUG_ON()s with VM_BUG_ON_PAGE() · cc30c5d6
      Andrew Morton authored
      
      
      So they are CONFIG_DEBUG_VM-only and more informative.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc30c5d6
    • Tetsuo Handa's avatar
      mm: don't emit warning from pagefault_out_of_memory() · a104808e
      Tetsuo Handa authored
      Commit c32b3cbe
      
       ("oom, PM: make OOM detection in the freezer path
      raceless") inserted a WARN_ON() into pagefault_out_of_memory() in order
      to warn when we raced with disabling the OOM killer.
      
      Now, patch "oom, suspend: fix oom_killer_disable vs.  pm suspend
      properly" introduced a timeout for oom_killer_disable().  Even if we
      raced with disabling the OOM killer and the system is OOM livelocked,
      the OOM killer will be enabled eventually (in 20 seconds by default) and
      the OOM livelock will be solved.  Therefore, we no longer need to warn
      when we raced with disabling the OOM killer.
      
      Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a104808e
    • Vlastimil Babka's avatar
      mm, compaction: restrict fragindex to costly orders · 20311420
      Vlastimil Babka authored
      
      
      Fragmentation index and the vm.extfrag_threshold sysctl is meant as a
      heuristic to prevent excessive compaction for costly orders (i.e.  THP).
      It's unlikely to make any difference for non-costly orders, especially
      with the default threshold.  But we cannot afford any uncertainty for
      the non-costly orders where the only alternative to successful
      reclaim/compaction is OOM.  After the recent patches we are guaranteed
      maximum effort without heuristics from compaction before deciding OOM,
      and fragindex is the last remaining heuristic.  Therefore skip fragindex
      altogether for non-costly orders.
      
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20160926162025.21555-5-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20311420
    • Vlastimil Babka's avatar
      mm, compaction: ignore fragindex from compaction_zonelist_suitable() · cc5c9f09
      Vlastimil Babka authored
      
      
      The compaction_zonelist_suitable() function tries to determine if
      compaction will be able to proceed after sufficient reclaim, i.e.
      whether there are enough reclaimable pages to provide enough order-0
      freepages for compaction.
      
      This addition of reclaimable pages to the free pages works well for the
      order-0 watermark check, but in the fragmentation index check we only
      consider truly free pages.  Thus we can get fragindex value close to 0
      which indicates failure do to lack of memory, and wrongly decide that
      compaction won't be suitable even after reclaim.
      
      Instead of trying to somehow adjust fragindex for reclaimable pages,
      let's just skip it from compaction_zonelist_suitable().
      
      Link: http://lkml.kernel.org/r/20160926162025.21555-4-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc5c9f09
    • Vlastimil Babka's avatar
      mm, page_alloc: pull no_progress_loops update to should_reclaim_retry() · 423b452e
      Vlastimil Babka authored
      
      
      The should_reclaim_retry() makes decisions based on no_progress_loops,
      so it makes sense to also update the counter there.  It will be also
      consistent with should_compact_retry() and compaction_retries.  No
      functional change.
      
      [hillf.zj@alibaba-inc.com: fix missing pointer dereferences]
      Link: http://lkml.kernel.org/r/20160926162025.21555-3-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      423b452e
    • Vlastimil Babka's avatar
      mm, compaction: make full priority ignore pageblock suitability · 9f7e3387
      Vlastimil Babka authored
      
      
      Several people have reported premature OOMs for order-2 allocations
      (stack) due to OOM rework in 4.7.  In the scenario (parallel kernel
      build and dd writing to two drives) many pageblocks get marked as
      Unmovable and compaction free scanner struggles to isolate free pages.
      Joonsoo Kim pointed out that the free scanner skips pageblocks that are
      not movable to prevent filling them and forcing non-movable allocations
      to fallback to other pageblocks.  Such heuristic makes sense to help
      prevent long-term fragmentation, but premature OOMs are relatively more
      urgent problem.  As a compromise, this patch disables the heuristic only
      for the ultimate compaction priority.
      
      Link: http://lkml.kernel.org/r/20160906135258.18335-5-vbabka@suse.cz
      Reported-by: default avatarRalf-Peter Rohbeck <Ralf-Peter.Rohbeck@quantum.com>
      Reported-by: default avatarArkadiusz Miskiewicz <a.miskiewicz@gmail.com>
      Reported-by: default avatarOlaf Hering <olaf@aepfle.de>
      Suggested-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f7e3387
    • Vlastimil Babka's avatar
      mm, compaction: restrict full priority to non-costly orders · c2033b00
      Vlastimil Babka authored
      
      
      The new ultimate compaction priority disables some heuristics, which may
      result in excessive cost.  This is fine for non-costly orders where we
      want to try hard before resulting for OOM, but might be disruptive for
      costly orders which do not trigger OOM and should generally have some
      fallback.  Thus, we disable the full priority for costly orders.
      
      Suggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Link: http://lkml.kernel.org/r/20160906135258.18335-4-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2033b00
    • Vlastimil Babka's avatar
      mm, compaction: more reliably increase direct compaction priority · d9436498
      Vlastimil Babka authored
      
      
      During reclaim/compaction loop, compaction priority can be increased by
      the should_compact_retry() function, but the current code is not
      optimal.  Priority is only increased when compaction_failed() is true,
      which means that compaction has scanned the whole zone.  This may not
      happen even after multiple attempts with a lower priority due to
      parallel activity, so we might needlessly struggle on the lower
      priorities and possibly run out of compaction retry attempts in the
      process.
      
      After this patch we are guaranteed at least one attempt at the highest
      compaction priority even if we exhaust all retries at the lower
      priorities.
      
      Link: http://lkml.kernel.org/r/20160906135258.18335-3-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9436498
    • Vlastimil Babka's avatar
      Revert "mm, oom: prevent premature OOM killer invocation for high order request" · 3250845d
      Vlastimil Babka authored
      Patch series "reintroduce compaction feedback for OOM decisions".
      
      After several people reported OOM's for order-2 allocations in 4.7 due
      to Michal Hocko's OOM rework, he reverted the part that considered
      compaction feedback [1] in the decisions to retry reclaim/compaction.
      This was to provide a fix quickly for 4.8 rc and 4.7 stable series,
      while mmotm had an almost complete solution that instead improved
      compaction reliability.
      
      This series completes the mmotm solution and reintroduces the compaction
      feedback into OOM decisions.  The first two patches restore the state of
      mmotm before the temporary solution was merged, the last patch should be
      the missing piece for reliability.  The third patch restricts the
      hardened compaction to non-costly orders, since costly orders don't
      result in OOMs in the first place.
      
      [1] http://marc.info/?i=20160822093249.GA14916%40dhcp22.suse.cz%3E
      
      This patch (of 4):
      
      Commit 6b4e3181 ("mm, oom: prevent premature OOM killer invocation
      for high order request") was intended as a quick fix of OOM regressions
      for 4.8 and stable 4.7.x kernels.  For a better long-term solution, we
      still want to consider compaction feedback, which should be possible
      after some more improvements in the following patches.
      
      This reverts commit 6b4e3181
      
      .
      
      Link: http://lkml.kernel.org/r/20160906135258.18335-2-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3250845d
    • Huang Ying's avatar
      mm: remove page_file_index · 8cd79788
      Huang Ying authored
      
      
      After using the offset of the swap entry as the key of the swap cache,
      the page_index() becomes exactly same as page_file_index().  So the
      page_file_index() is removed and the callers are changed to use
      page_index() instead.
      
      Link: http://lkml.kernel.org/r/1473270649-27229-2-git-send-email-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Anna Schumaker <anna.schumaker@netapp.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cd79788
    • Huang Ying's avatar
      mm, swap: use offset of swap entry as key of swap cache · f6ab1f7f
      Huang Ying authored
      
      
      This patch is to improve the performance of swap cache operations when
      the type of the swap device is not 0.  Originally, the whole swap entry
      value is used as the key of the swap cache, even though there is one
      radix tree for each swap device.  If the type of the swap device is not
      0, the height of the radix tree of the swap cache will be increased
      unnecessary, especially on 64bit architecture.  For example, for a 1GB
      swap device on the x86_64 architecture, the height of the radix tree of
      the swap cache is 11.  But if the offset of the swap entry is used as
      the key of the swap cache, the height of the radix tree of the swap
      cache is 4.  The increased height causes unnecessary radix tree
      descending and increased cache footprint.
      
      This patch reduces the height of the radix tree of the swap cache via
      using the offset of the swap entry instead of the whole swap entry value
      as the key of the swap cache.  In 32 processes sequential swap out test
      case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
      for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
      when the type of the swap device is 1.
      
      Use the whole swap entry as key,
      
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,
      
      Use the swap offset as key,
      
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,
      
      Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6ab1f7f
    • Dan Williams's avatar
      mm: fix cache mode tracking in vm_insert_mixed() · 87744ab3
      Dan Williams authored
      
      
      vm_insert_mixed() unlike vm_insert_pfn_prot() and vmf_insert_pfn_pmd(),
      fails to check the pgprot_t it uses for the mapping against the one
      recorded in the memtype tracking tree.  Add the missing call to
      track_pfn_insert() to preclude cases where incompatible aliased mappings
      are established for a given physical address range.
      
      Link: http://lkml.kernel.org/r/147328717909.35069.14256589123570653697.stgit@dwillia2-desk3.amr.corp.intel.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      87744ab3
    • Reza Arbab's avatar
      memory-hotplug: fix store_mem_state() return value · d66ba15b
      Reza Arbab authored
      
      
      If store_mem_state() is called to online memory which is already online,
      it will return 1, the value it got from device_online().
      
      This is wrong because store_mem_state() is a device_attribute .store
      function.  Thus a non-negative return value represents input bytes read.
      
      Set the return value to -EINVAL in this case.
      
      Link: http://lkml.kernel.org/r/1472743777-24266-1-git-send-email-arbab@linux.vnet.ibm.com
      Signed-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Andrew Banman <abanman@sgi.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d66ba15b
    • James Morse's avatar
      mm/memcontrol.c: make the walk_page_range() limit obvious · 0247f3f4
      James Morse authored
      
      
      mem_cgroup_count_precharge() and mem_cgroup_move_charge() both call
      walk_page_range() on the range 0 to ~0UL, neither provide a pte_hole
      callback, which causes the current implementation to skip non-vma
      regions.  This is all fine but follow up changes would like to make
      walk_page_range more generic so it is better to be explicit about which
      range to traverse so let's use highest_vm_end to explicitly traverse
      only user mmaped memory.
      
      [mhocko@kernel.org: rewrote changelog]
      Link: http://lkml.kernel.org/r/1472655897-22532-1-git-send-email-james.morse@arm.com
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0247f3f4
    • Aaron Lu's avatar
      thp: reduce usage of huge zero page's atomic counter · 6fcb52a5
      Aaron Lu authored
      
      
      The global zero page is used to satisfy an anonymous read fault.  If
      THP(Transparent HugePage) is enabled then the global huge zero page is
      used.  The global huge zero page uses an atomic counter for reference
      counting and is allocated/freed dynamically according to its counter
      value.
      
      CPU time spent on that counter will greatly increase if there are a lot
      of processes doing anonymous read faults.  This patch proposes a way to
      reduce the access to the global counter so that the CPU load can be
      reduced accordingly.
      
      To do this, a new flag of the mm_struct is introduced:
      MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
      the global counter in two cases:
      
       1 The first time it uses the global huge zero page;
       2 The time when mm_user of its mm_struct reaches zero.
      
      Note that right now, the huge zero page is eligible to be freed as soon
      as its last use goes away.  With this patch, the page will not be
      eligible to be freed until the exit of the last process from which it
      was ever used.
      
      And with the use of mm_user, the kthread is not eligible to use huge
      zero page either.  Since no kthread is using huge zero page today, there
      is no difference after applying this patch.  But if that is not desired,
      I can change it to when mm_count reaches zero.
      
      Case used for test on Haswell EP:
      
        usemem -n 72 --readonly -j 0x200000 100G
      
      Which spawns 72 processes and each will mmap 100G anonymous space and
      then do read only access to that space sequentially with a step of 2MB.
      
        CPU cycles from perf report for base commit:
            54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
        CPU cycles from perf report for this commit:
             0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page
      
      Performance(throughput) of the workload for base commit: 1784430792
      Performance(throughput) of the workload for this commit: 4726928591
      164% increase.
      
      Runtime of the workload for base commit: 707592 us
      Runtime of the workload for this commit: 303970 us
      50% drop.
      
      Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
      Signed-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6fcb52a5
    • James Morse's avatar
      fs/proc/task_mmu.c: make the task_mmu walk_page_range() limit in clear_refs_write() obvious · 0f30206b
      James Morse authored
      
      
      Trying to walk all of virtual memory requires architecture specific
      knowledge.  On x86_64, addresses must be sign extended from bit 48,
      whereas on arm64 the top VA_BITS of address space have their own set of
      page tables.
      
      clear_refs_write() calls walk_page_range() on the range 0 to ~0UL, it
      provides a test_walk() callback that only expects to be walking over
      VMAs.  Currently walk_pmd_range() will skip memory regions that don't
      have a VMA, reporting them as a hole.
      
      As this call only expects to walk user address space, make it walk 0 to
      'highest_vm_end'.
      
      Link: http://lkml.kernel.org/r/1472655792-22439-1-git-send-email-james.morse@arm.com
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f30206b
    • Tim Chen's avatar
      cpu: fix node state for whether it contains CPU · 03e86dba
      Tim Chen authored
      
      
      In current kernel code, we only call node_set_state(cpu_to_node(cpu),
      N_CPU) when a cpu is hot plugged.  But we do not set the node state for
      N_CPU when the cpus are brought online during boot.
      
      So this could lead to failure when we check to see if a node contains
      cpu with node_state(node_id, N_CPU).
      
      One use case is in the node_reclaime function:
      
              /*
               * Only run node reclaim on the local node or on nodes that do
               * not
               * have associated processors. This will favor the local
               * processor
               * over remote processors and spread off node memory allocations
               * as wide as possible.
               */
              if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id !=
                      numa_node_id())
                      return NODE_RECLAIM_NOSCAN;
      
      I instrumented the kernel to call this function after boot and it always
      returns 0 on a x86 desktop machine until I apply the attached patch.
      
         int num_cpu_node(void)
         {
             int i, nr_cpu_nodes = 0;
      
             for_each_node(i) {
                     if (node_state(i, N_CPU))
                             ++ nr_cpu_nodes;
             }
      
             return nr_cpu_nodes;
         }
      
      Fix this by checking each node for online CPU when we initialize
      vmstat that's responsible for maintaining node state.
      
      Link: http://lkml.kernel.org/r/20160829175922.GA21775@linux.intel.com
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: <Huang@linux.intel.com>
      Cc: Ying <ying.huang@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03e86dba
    • Toshi Kani's avatar
      ext2/4, xfs: call thp_get_unmapped_area() for pmd mappings · dbe6ec81
      Toshi Kani authored
      
      
      To support DAX pmd mappings with unmodified applications, filesystems
      need to align an mmap address by the pmd size.
      
      Call thp_get_unmapped_area() from f_op->get_unmapped_area.
      
      Note, there is no change in behavior for a non-DAX file.
      
      Link: http://lkml.kernel.org/r/1472497881-9323-3-git-send-email-toshi.kani@hpe.com
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dbe6ec81
    • Toshi Kani's avatar
      thp, dax: add thp_get_unmapped_area for pmd mappings · 74d2fad1
      Toshi Kani authored
      
      
      When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using pmd page size.
      This feature relies on both mmap virtual address and FS block (i.e.
      physical address) to be aligned by the pmd page size.  Users can use
      mkfs options to specify FS to align block allocations.  However,
      aligning mmap address requires code changes to existing applications for
      providing a pmd-aligned address to mmap().
      
      For instance, fio with "ioengine=mmap" performs I/Os with mmap() [1].
      It calls mmap() with a NULL address, which needs to be changed to
      provide a pmd-aligned address for testing with DAX pmd mappings.
      Changing all applications that call mmap() with NULL is undesirable.
      
      Add thp_get_unmapped_area(), which can be called by filesystem's
      get_unmapped_area to align an mmap address by the pmd size for a DAX
      file.  It calls the default handler, mm->get_unmapped_area(), to find a
      range and then aligns it for a DAX file.
      
      The patch is based on Matthew Wilcox's change that allows adding support
      of the pud page size easily.
      
      [1]: https://github.com/axboe/fio/blob/master/engines/mmap.c
      Link: http://lkml.kernel.org/r/1472497881-9323-2-git-send-email-toshi.kani@hpe.com
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74d2fad1
    • Simon Guo's avatar
      selftests: expanding more mlock selftest · 26b4224d
      Simon Guo authored
      
      
      This patch will randomly perform mlock/mlock2 on a given memory region,
      and verify the RLIMIT_MEMLOCK limitation works properly.
      
      Suggested-by: default avatarDavid Rientjes <rientjes@google.com>
      Link: http://lkml.kernel.org/r/1473325970-11393-4-git-send-email-wei.guo.simon@gmail.com
      Signed-off-by: default avatarSimon Guo <wei.guo.simon@gmail.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Simon Guo <wei.guo.simon@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Alexey Klimov <klimov.linux@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Thierry Reding <treding@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      26b4224d
    • Simon Guo's avatar
      selftest: move seek_to_smaps_entry() out of mlock2-tests.c · d5aed9c0
      Simon Guo authored
      
      
      Function seek_to_smaps_entry() can be useful for other selftest
      functionalities, so move it out to header file.
      
      Link: http://lkml.kernel.org/r/1473325970-11393-3-git-send-email-wei.guo.simon@gmail.com
      Signed-off-by: default avatarSimon Guo <wei.guo.simon@gmail.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Simon Guo <wei.guo.simon@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Alexey Klimov <klimov.linux@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Thierry Reding <treding@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5aed9c0
    • Simon Guo's avatar
      selftests/vm: add test for mlock() when areas are intersected · 1448d4d8
      Simon Guo authored
      
      
      This patch adds mlock() test for multiple invocation on the same address
      area, and verify it doesn't mess the rlimit mlock limitation.
      
      Link: http://lkml.kernel.org/r/1472554781-9835-5-git-send-email-wei.guo.simon@gmail.com
      Signed-off-by: default avatarSimon Guo <wei.guo.simon@gmail.com>
      Cc: Alexey Klimov <klimov.linux@gmail.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Simon Guo <wei.guo.simon@gmail.com>
      Cc: Thierry Reding <treding@nvidia.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1448d4d8
    • Simon Guo's avatar
      selftest: split mlock2_ funcs into separate mlock2.h · c7f032bb
      Simon Guo authored
      
      
      To prepare mlock2.h whose functionality will be reused.
      
      Link: http://lkml.kernel.org/r/1472554781-9835-4-git-send-email-wei.guo.simon@gmail.com
      Signed-off-by: default avatarSimon Guo <wei.guo.simon@gmail.com>
      Cc: Alexey Klimov <klimov.linux@gmail.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Simon Guo <wei.guo.simon@gmail.com>
      Cc: Thierry Reding <treding@nvidia.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7f032bb
    • Simon Guo's avatar
      mm: mlock: avoid increase mm->locked_vm on mlock() when already mlock2(,MLOCK_ONFAULT) · b155b4fd
      Simon Guo authored
      
      
      When one vma was with flag VM_LOCKED|VM_LOCKONFAULT (by invoking
      mlock2(,MLOCK_ONFAULT)), it can again be populated with mlock() with
      VM_LOCKED flag only.
      
      There is a hole in mlock_fixup() which increase mm->locked_vm twice even
      the two operations are on the same vma and both with VM_LOCKED flags.
      
      The issue can be reproduced by following code:
      
        mlock2(p, 1024 * 64, MLOCK_ONFAULT); //VM_LOCKED|VM_LOCKONFAULT
        mlock(p, 1024 * 64);  //VM_LOCKED
      
      Then check the increase VmLck field in /proc/pid/status(to 128k).
      
      When vma is set with different vm_flags, and the new vm_flags is with
      VM_LOCKED, it is not necessarily be a "new locked" vma.  This patch
      corrects this bug by prevent mm->locked_vm from increment when old
      vm_flags is already VM_LOCKED.
      
      Link: http://lkml.kernel.org/r/1472554781-9835-3-git-send-email-wei.guo.simon@gmail.com
      Signed-off-by: default avatarSimon Guo <wei.guo.simon@gmail.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alexey Klimov <klimov.linux@gmail.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Simon Guo <wei.guo.simon@gmail.com>
      Cc: Thierry Reding <treding@nvidia.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b155b4fd
    • Simon Guo's avatar
      mm: mlock: check against vma for actual mlock() size · 0cf2f6f6
      Simon Guo authored
      
      
      In do_mlock(), the check against locked memory limitation has a hole
      which will fail following cases at step 3):
      
       1) User has a memory chunk from addressA with 50k, and user mem lock
          rlimit is 64k.
       2) mlock(addressA, 30k)
       3) mlock(addressA, 40k)
      
      The 3rd step should have been allowed since the 40k request is
      intersected with the previous 30k at step 2), and the 3rd step is
      actually for mlock on the extra 10k memory.
      
      This patch checks vma to caculate the actual "new" mlock size, if
      necessary, and ajust the logic to fix this issue.
      
      [akpm@linux-foundation.org: clean up comment layout]
      [wei.guo.simon@gmail.com: correct a typo in count_mm_mlocked_page_nr()]
       Link: http://lkml.kernel.org/r/1473325970-11393-2-git-send-email-wei.guo.simon@gmail.com
      Link: http://lkml.kernel.org/r/1472554781-9835-2-git-send-email-wei.guo.simon@gmail.com
      Signed-off-by: default avatarSimon Guo <wei.guo.simon@gmail.com>
      Cc: Alexey Klimov <klimov.linux@gmail.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Simon Guo <wei.guo.simon@gmail.com>
      Cc: Thierry Reding <treding@nvidia.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0cf2f6f6
    • Michal Hocko's avatar
      oom: warn if we go OOM for higher order and compaction is disabled · 9254990f
      Michal Hocko authored
      
      
      Since the lumpy reclaim is gone there is no source of higher order pages
      if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is
      unreliable for that purpose to say the least.  Hitting an OOM for
      !costly higher order requests is therefore all not that hard to imagine.
      We are trying hard to not invoke OOM killer as much as possible but
      there is simply no reliable way to detect whether more reclaim retries
      make sense.
      
      Disabling COMPACTION is not widespread but it seems that some users
      might have disable the feature without realizing full consequences
      (mostly along with disabling THP because compaction used to be THP
      mainly thing).  This patch just adds a note if the OOM killer was
      triggered by higher order request with compaction disabled.  This will
      help us identifying possible misconfiguration right from the oom report
      which is easier than to always keep in mind that somebody might have
      disabled COMPACTION without a good reason.
      
      Link: http://lkml.kernel.org/r/20160830111632.GD23963@dhcp22.suse.cz
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9254990f
    • Huang Ying's avatar
      mm: don't use radix tree writeback tags for pages in swap cache · 371a096e
      Huang Ying authored
      
      
      File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
      etc.) to accelerate finding the pages with a specific tag in the radix
      tree during inode writeback.  But for anonymous pages in the swap cache,
      there is no inode writeback.  So there is no need to find the pages with
      some writeback tags in the radix tree.  It is not necessary to touch
      radix tree writeback tags for pages in the swap cache.
      
      Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
      introduced for address spaces which don't need to update the writeback
      tags.  The flag is set for swap caches.  It may be used for DAX file
      systems, etc.
      
      With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
      ~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
      The test is done on a Xeon E5 v3 system.  The swap device used is a RAM
      simulated PMEM (persistent memory) device.  The improvement comes from
      the reduced contention on the swap cache radix tree lock.  To test
      sequential swapping out, the test case uses 8 processes, which
      sequentially allocate and write to the anonymous pages until RAM and
      part of the swap device is used up.
      
      Details of comparison is as follow,
      
      base             base+patch
      ---------------- --------------------------
               %stddev     %change         %stddev
                   \          |                \
         2506952 ±  2%     +28.1%    3212076 ±  7%  vm-scalability.throughput
         1207402 ±  7%     +22.3%    1476578 ±  6%  vmstat.swap.so
           10.86 ± 12%     -23.4%       8.31 ± 16%  perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
           10.82 ± 13%     -33.1%       7.24 ± 14%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg
           10.36 ± 11%    -100.0%       0.00 ± -1%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage
           10.52 ± 12%    -100.0%       0.00 ± -1%  perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page
      
      Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      371a096e
    • zijun_hu's avatar
      mm/bootmem.c: replace kzalloc() by kzalloc_node() · 1d8bf926
      zijun_hu authored
      
      
      In ___alloc_bootmem_node_nopanic(), replace kzalloc() by kzalloc_node()
      in order to allocate memory within given node preferentially when slab
      is available
      
      Link: http://lkml.kernel.org/r/1f487f12-6af4-5e4f-a28c-1de2361cdcd8@zoho.com
      Signed-off-by: default avatarzijun_hu <zijun_hu@htc.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d8bf926
    • zijun_hu's avatar
      mm/nobootmem.c: remove duplicate macro ARCH_LOW_ADDRESS_LIMIT statements · 2382705f
      zijun_hu authored
      
      
      Fix the following bugs:
      
       - the same ARCH_LOW_ADDRESS_LIMIT statements are duplicated between
         header and relevant source
      
       - don't ensure ARCH_LOW_ADDRESS_LIMIT perhaps defined by ARCH in
         asm/processor.h is preferred over default in linux/bootmem.h
         completely since the former header isn't included by the latter
      
      Link: http://lkml.kernel.org/r/e046aeaa-e160-6d9e-dc1b-e084c2fd999f@zoho.com
      Signed-off-by: default avatarzijun_hu <zijun_hu@htc.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2382705f
    • Srikar Dronamraju's avatar
      powerpc: implement arch_reserved_kernel_pages · 1e76609c
      Srikar Dronamraju authored
      
      
      Currently significant amount of memory is reserved only in kernel booted
      to capture kernel dump using the fa_dump method.
      
      Kernels compiled with CONFIG_DEFERRED_STRUCT_PAGE_INIT will initialize
      only certain size memory per node.  The certain size takes into account
      the dentry and inode cache sizes.  Currently the cache sizes are
      calculated based on the total system memory including the reserved
      memory.  However such a kernel when booting the same kernel as fadump
      kernel will not be able to allocate the required amount of memory to
      suffice for the dentry and inode caches.  This results in crashes like
      
      Hence only implement arch_reserved_kernel_pages() for CONFIG_FA_DUMP
      configurations.  The amount reserved will be reduced while calculating
      the large caches and will avoid crashes like the below on large systems
      such as 32 TB systems.
      
        Dentry cache hash table entries: 536870912 (order: 16, 4294967296 bytes)
        vmalloc: allocation failure, allocated 4097114112 of 17179934720 bytes
        swapper/0: page allocation failure: order:0, mode:0x2080020(GFP_ATOMIC)
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6-master+ #3
        Call Trace:
           dump_stack+0xb0/0xf0 (unreliable)
           warn_alloc_failed+0x114/0x160
           __vmalloc_node_range+0x304/0x340
           __vmalloc+0x6c/0x90
           alloc_large_system_hash+0x1b8/0x2c0
           inode_init+0x94/0xe4
           vfs_caches_init+0x8c/0x13c
           start_kernel+0x50c/0x578
           start_here_common+0x20/0xa8
      
      Link: http://lkml.kernel.org/r/1472476010-4709-4-git-send-email-srikar@linux.vnet.ibm.com
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Suggested-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e76609c
    • Srikar Dronamraju's avatar
      mm/memblock.c: expose total reserved memory · 8907de5d
      Srikar Dronamraju authored
      
      
      The total reserved memory in a system is accounted but not available for
      use use outside mm/memblock.c.  By exposing the total reserved memory,
      systems can better calculate the size of large hashes.
      
      Link: http://lkml.kernel.org/r/1472476010-4709-3-git-send-email-srikar@linux.vnet.ibm.com
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Suggested-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8907de5d
    • Srikar Dronamraju's avatar
      mm: introduce arch_reserved_kernel_pages() · f6f34b43
      Srikar Dronamraju authored
      
      
      Currently arch specific code can reserve memory blocks but
      alloc_large_system_hash() may not take it into consideration when sizing
      the hashes.  This can lead to bigger hash than required and lead to no
      available memory for other purposes.  This is specifically true for
      systems with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
      
      One approach to solve this problem would be to walk through the memblock
      regions and calculate the available memory and base the size of hash
      system on the available memory.
      
      The other approach would be to depend on the architecture to provide the
      number of pages that are reserved.  This change provides hooks to allow
      the architecture to provide the required info.
      
      Link: http://lkml.kernel.org/r/1472476010-4709-2-git-send-email-srikar@linux.vnet.ibm.com
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Suggested-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6f34b43
    • Aneesh Kumar K.V's avatar
      mm: use zonelist name instead of using hardcoded index · c9634cf0
      Aneesh Kumar K.V authored
      
      
      Use the existing enums instead of hardcoded index when looking at the
      zonelist.  This makes it more readable.  No functionality change by this
      patch.
      
      Link: http://lkml.kernel.org/r/1472227078-24852-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9634cf0
    • Michal Hocko's avatar
      oom, oom_reaper: allow to reap mm shared by the kthreads · 1b51e65e
      Michal Hocko authored
      
      
      oom reaper was skipped for an mm which is shared with the kernel thread
      (aka use_mm()).  The primary concern was that such a kthread might want
      to read from the userspace memory and see zero page as a result of the
      oom reaper action.  This is no longer a problem after "mm: make sure
      that kthreads will not refault oom reaped memory" because any attempt to
      fault in when the MMF_UNSTABLE is set will result in SIGBUS and so the
      target user should see an error.  This means that we can finally allow
      oom reaper also to tasks which share their mm with kthreads.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-10-git-send-email-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b51e65e