Skip to content
  1. Nov 16, 2017
    • Tetsuo Handa's avatar
      mm: don't warn about allocations which stall for too long · 400e2249
      Tetsuo Handa authored
      Commit 63f53dea ("mm: warn about allocations which stall for too
      long") was a great step for reducing possibility of silent hang up
      problem caused by memory allocation stalls.  But this commit reverts it,
      for it is possible to trigger OOM lockup and/or soft lockups when many
      threads concurrently called warn_alloc() (in order to warn about memory
      allocation stalls) due to current implementation of printk(), and it is
      difficult to obtain useful information due to limitation of synchronous
      warning approach.
      
      Current printk() implementation flushes all pending logs using the
      context of a thread which called console_unlock().  printk() should be
      able to flush all pending logs eventually unless somebody continues
      appending to printk() buffer.
      
      Since warn_alloc() started appending to printk() buffer while waiting
      for oom_kill_process() to make forward progress when oom_kill_process()
      is processing pending logs, it became possible for warn_alloc() to force
      oom_kill_process() loop inside printk().  As a result, warn_alloc()
      significantly increased possibility of preventing oom_kill_process()
      from making forward progress.
      
      ---------- Pseudo code start ----------
      Before warn_alloc() was introduced:
      
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          }
          goto retry;
      
      After warn_alloc() was introduced:
      
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          } else if (waited_for_10seconds()) {
            atomic_inc(&printk_pending_logs);
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      Although waited_for_10seconds() becomes true once per 10 seconds,
      unbounded number of threads can call waited_for_10seconds() at the same
      time.  Also, since threads doing waited_for_10seconds() keep doing
      almost busy loop, the thread doing print_one_log() can use little CPU
      resource.  Therefore, this situation can be simplified like
      
      ---------- Pseudo code start ----------
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          } else {
            atomic_inc(&printk_pending_logs);
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      when printk() is called faster than print_one_log() can process a log.
      
      One of possible mitigation would be to introduce a new lock in order to
      make sure that no other series of printk() (either oom_kill_process() or
      warn_alloc()) can append to printk() buffer when one series of printk()
      (either oom_kill_process() or warn_alloc()) is already in progress.
      
      Such serialization will also help obtaining kernel messages in readable
      form.
      
      ---------- Pseudo code start ----------
        retry:
          if (mutex_trylock(&oom_lock)) {
            mutex_lock(&oom_printk_lock);
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_printk_lock);
            mutex_unlock(&oom_lock)
          } else {
            if (mutex_trylock(&oom_printk_lock)) {
              atomic_inc(&printk_pending_logs);
              mutex_unlock(&oom_printk_lock);
            }
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      But this commit does not go that direction, for we don't want to
      introduce a new lock dependency, and we unlikely be able to obtain
      useful information even if we serialized oom_kill_process() and
      warn_alloc().
      
      Synchronous approach is prone to unexpected results (e.g.  too late [1],
      too frequent [2], overlooked [3]).  As far as I know, warn_alloc() never
      helped with providing information other than "something is going wrong".
      I want to consider asynchronous approach which can obtain information
      during stalls with possibly relevant threads (e.g.  the owner of
      oom_lock and kswapd-like threads) and serve as a trigger for actions
      (e.g.  turn on/off tracepoints, ask libvirt daemon to take a memory dump
      of stalling KVM guest for diagnostic purpose).
      
      This commit temporarily loses ability to report e.g.  OOM lockup due to
      unable to invoke the OOM killer due to !__GFP_FS allocation request.
      But asynchronous approach will be able to detect such situation and emit
      warning.  Thus, let's remove warn_alloc().
      
      [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
      [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
      [3] commit db73ee0d ("mm, vmscan: do not loop on too_many_isolated for ever"))
      
      Link: http://lkml.kernel.org/r/1509017339-4802-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
      
      
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Reported-by: default avataryuwang.yuwang <yuwang.yuwang@alibaba-inc.com>
      Reported-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      400e2249
    • Johannes Weiner's avatar
      fs: fuse: account fuse_inode slab memory as reclaimable · df206988
      Johannes Weiner authored
      Fuse inodes are currently included in the unreclaimable slab counts -
      SUnreclaim in /proc/meminfo, slab_unreclaimable in /proc/vmstat and the
      per-cgroup memory.stat.  But they are reclaimable just like other
      filesystems' inodes, and /proc/sys/vm/drop_caches frees them easily.
      
      Mark the slab cache reclaimable.
      
      Link: http://lkml.kernel.org/r/20171102202727.12539-1-hannes@cmpxchg.org
      
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df206988
    • Vlastimil Babka's avatar
      mm, page_alloc: fix potential false positive in __zone_watermark_ok · b050e376
      Vlastimil Babka authored
      Since commit 97a16fc8 ("mm, page_alloc: only enforce watermarks for
      order-0 allocations"), __zone_watermark_ok() check for high-order
      allocations will shortcut per-migratetype free list checks for
      ALLOC_HARDER allocations, and return true as long as there's free page
      of any migratetype.  The intention is that ALLOC_HARDER can allocate
      from MIGRATE_HIGHATOMIC free lists, while normal allocations can't.
      
      However, as a side effect, the watermark check will then also return
      true when there are pages only on the MIGRATE_ISOLATE list, or (prior to
      CMA conversion to ZONE_MOVABLE) on the MIGRATE_CMA list.  Since the
      allocation cannot actually obtain isolated pages, and might not be able
      to obtain CMA pages, this can result in a false positive.
      
      The condition should be rare and perhaps the outcome is not a fatal one.
      Still, it's better if the watermark check is correct.  There also
      shouldn't be a performance tradeoff here.
      
      Link: http://lkml.kernel.org/r/20171102125001.23708-1-vbabka@suse.cz
      Fixes: 97a16fc8
      
       ("mm, page_alloc: only enforce watermarks for order-0 allocations")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b050e376
    • Shakeel Butt's avatar
      mm: mlock: remove lru_add_drain_all() · 72b03fcd
      Shakeel Butt authored
      lru_add_drain_all() is not required by mlock() and it will drain
      everything that has been cached at the time mlock is called.  And that
      is not really related to the memory which will be faulted in (and
      cached) and mlocked by the syscall itself.
      
      If anything lru_add_drain_all() should be called _after_ pages have been
      mlocked and faulted in but even that is not strictly needed because
      those pages would get to the appropriate LRUs lazily during the reclaim
      path.  Moreover follow_page_pte (gup) will drain the local pcp LRU
      cache.
      
      On larger machines the overhead of lru_add_drain_all() in mlock() can be
      significant when mlocking data already in memory.  We have observed high
      latency in mlock() due to lru_add_drain_all() when the users were
      mlocking in memory tmpfs files.
      
      [mhocko@suse.com: changelog fix]
      Link: http://lkml.kernel.org/r/20171019222507.2894-1-shakeelb@google.com
      
      
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72b03fcd
    • Kemi Wang's avatar
      mm, sysctl: make NUMA stats configurable · 4518085e
      Kemi Wang authored
      This is the second step which introduces a tunable interface that allow
      numa stats configurable for optimizing zone_statistics(), as suggested
      by Dave Hansen and Ying Huang.
      
      =========================================================================
      
      When page allocation performance becomes a bottleneck and you can
      tolerate some possible tool breakage and decreased numa counter
      precision, you can do:
      
      	echo 0 > /proc/sys/vm/numa_stat
      
      In this case, numa counter update is ignored.  We can see about
      *4.8%*(185->176) drop of cpu cycles per single page allocation and
      reclaim on Jesper's page_bench01 (single thread) and *8.1%*(343->315)
      drop of cpu cycles per single page allocation and reclaim on Jesper's
      page_bench03 (88 threads) running on a 2-Socket Broadwell-based server
      (88 threads, 126G memory).
      
      Benchmark link provided by Jesper D Brouer (increase loop times to
      10000000):
      
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
      =========================================================================
      
      When page allocation performance is not a bottleneck and you want all
      tooling to work, you can do:
      
      	echo 1 > /proc/sys/vm/numa_stat
      
      This is system default setting.
      
      Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
      for comments to help improve the original patch.
      
      [keescook@chromium.org: make sure mutex is a global static]
        Link: http://lkml.kernel.org/r/20171107213809.GA4314@beast
      Link: http://lkml.kernel.org/r/1508290927-8518-1-git-send-email-kemi.wang@intel.com
      
      
      Signed-off-by: default avatarKemi Wang <kemi.wang@intel.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reported-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Suggested-by: default avatarYing Huang <ying.huang@intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4518085e
    • weiping zhang's avatar
      shmem: convert shmem_init_inodecache() to void · 9a8ec03e
      weiping zhang authored
      shmem_inode_cachep was created with SLAB_PANIC flag and
      shmem_init_inodecache() never returns non-zero, so convert this
      function to return void.
      
      Link: http://lkml.kernel.org/r/20170909124542.GA35224@bogon.didichuxing.com
      
      
      Signed-off-by: default avatarweiping zhang <zhangweiping@didichuxing.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a8ec03e
    • Otto Ebeling's avatar
      Unify migrate_pages and move_pages access checks · 31367466
      Otto Ebeling authored
      Commit 197e7e52 ("Sanitize 'move_pages()' permission checks") fixed
      a security issue I reported in the move_pages syscall, and made it so
      that you can't act on set-uid processes unless you have the
      CAP_SYS_PTRACE capability.
      
      Unify the access check logic of migrate_pages to match the new behavior
      of move_pages.  We discussed this a bit in the security@ list and
      thought it'd be good for consistency even though there's no evident
      security impact.  The NUMA node access checks are left intact and
      require CAP_SYS_NICE as before.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1710011830320.6333@lakka.kapsi.fi
      
      
      Signed-off-by: default avatarOtto Ebeling <otto.ebeling@iki.fi>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31367466
    • Mel Gorman's avatar
      mm, pagevec: rename pagevec drained field · 7f0b5fb9
      Mel Gorman authored
      According to Vlastimil Babka, the drained field in pagevec is
      potentially misleading because it might be interpreted as draining this
      pagevec instead of the percpu lru pagevecs.  Rename the field for
      clarity.
      
      Link: http://lkml.kernel.org/r/20171019093346.ylahzdpzmoriyf4v@techsingularity.net
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f0b5fb9
    • Vlastimil Babka's avatar
      mm, page_alloc: simplify list handling in rmqueue_bulk() · 0fac3ba5
      Vlastimil Babka authored
      rmqueue_bulk() fills an empty pcplist with pages from the free list.  It
      tries to preserve increasing order by pfn to the caller, because it
      leads to better performance with some I/O controllers, as explained in
      commit e084b2d9 ("page-allocator: preserve PFN ordering when
      __GFP_COLD is set").
      
      To preserve the order, it's sufficient to add pages to the tail of the
      list as they are retrieved.  The current code instead adds to the head
      of the list, but then updates the list head pointer to the last added
      page, in each step.  This does result in the same order, but is
      needlessly confusing and potentially wasteful, with no apparent benefit.
      This patch simplifies the code and adjusts comment accordingly.
      
      Link: http://lkml.kernel.org/r/f6505442-98a9-12e4-b2cd-0fa83874c159@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0fac3ba5
    • Mel Gorman's avatar
      mm: remove __GFP_COLD · 453f85d4
      Mel Gorman authored
      As the page free path makes no distinction between cache hot and cold
      pages, there is no real useful ordering of pages in the free list that
      allocation requests can take advantage of.  Juding from the users of
      __GFP_COLD, it is likely that a number of them are the result of copying
      other sites instead of actually measuring the impact.  Remove the
      __GFP_COLD parameter which simplifies a number of paths in the page
      allocator.
      
      This is potentially controversial but bear in mind that the size of the
      per-cpu pagelists versus modern cache sizes means that the whole per-cpu
      list can often fit in the L3 cache.  Hence, there is only a potential
      benefit for microbenchmarks that alloc/free pages in a tight loop.  It's
      even worse when THP is taken into account which has little or no chance
      of getting a cache-hot page as the per-cpu list is bypassed and the
      zeroing of multiple pages will thrash the cache anyway.
      
      The truncate microbenchmarks are not shown as this patch affects the
      allocation path and not the free path.  A page fault microbenchmark was
      tested but it showed no sigificant difference which is not surprising
      given that the __GFP_COLD branches are a miniscule percentage of the
      fault path.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      453f85d4
    • Mel Gorman's avatar
      mm: remove cold parameter from free_hot_cold_page* · 2d4894b5
      Mel Gorman authored
      Most callers users of free_hot_cold_page claim the pages being released
      are cache hot.  The exception is the page reclaim paths where it is
      likely that enough pages will be freed in the near future that the
      per-cpu lists are going to be recycled and the cache hotness information
      is lost.  As no one really cares about the hotness of pages being
      released to the allocator, just ditch the parameter.
      
      The APIs are renamed to indicate that it's no longer about hot/cold
      pages.  It should also be less confusing as there are subtle differences
      between them.  __free_pages drops a reference and frees a page when the
      refcount reaches zero.  free_hot_cold_page handled pages whose refcount
      was already zero which is non-obvious from the name.  free_unref_page
      should be more obvious.
      
      No performance impact is expected as the overhead is marginal.  The
      parameter is removed simply because it is a bit stupid to have a useless
      parameter copied everywhere.
      
      [mgorman@techsingularity.net: add pages to head, not tail]
        Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net
      Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d4894b5
    • Mel Gorman's avatar
      mm: remove cold parameter for release_pages · c6f92f9f
      Mel Gorman authored
      All callers of release_pages claim the pages being released are cache
      hot.  As no one cares about the hotness of pages being released to the
      allocator, just ditch the parameter.
      
      No performance impact is expected as the overhead is marginal.  The
      parameter is removed simply because it is a bit stupid to have a useless
      parameter copied everywhere.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-7-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6f92f9f
    • Mel Gorman's avatar
      mm, pagevec: remove cold parameter for pagevecs · 86679820
      Mel Gorman authored
      Every pagevec_init user claims the pages being released are hot even in
      cases where it is unlikely the pages are hot.  As no one cares about the
      hotness of pages being released to the allocator, just ditch the
      parameter.
      
      No performance impact is expected as the overhead is marginal.  The
      parameter is removed simply because it is a bit stupid to have a useless
      parameter copied everywhere.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86679820
    • Mel Gorman's avatar
      mm: only drain per-cpu pagevecs once per pagevec usage · d9ed0d08
      Mel Gorman authored
      When a pagevec is initialised on the stack, it is generally used
      multiple times over a range of pages, looking up entries and then
      releasing them.  On each pagevec_release, the per-cpu deferred LRU
      pagevecs are drained on the grounds the page being released may be on
      those queues and the pages may be cache hot.  In many cases only the
      first drain is necessary as it's unlikely that the range of pages being
      walked is racing against LRU addition.  Even if there is such a race,
      the impact is marginal where as constantly redraining the lru pagevecs
      costs.
      
      This patch ensures that pagevec is only drained once in a given
      lifecycle without increasing the cache footprint of the pagevec
      structure.  Only sparsetruncate tiny is shown here as large files have
      many exceptional entries and calls pagecache_release less frequently.
      
      sparsetruncate (tiny)
                                    4.14.0-rc4             4.14.0-rc4
                              batchshadow-v1r1          onedrain-v1r1
      Min          Time      141.00 (   0.00%)      141.00 (   0.00%)
      1st-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
      2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
      3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
      Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
      Max-95%      Time      146.00 (   0.00%)      145.00 (   0.68%)
      Max-99%      Time      198.00 (   0.00%)      194.00 (   2.02%)
      Max          Time      254.00 (   0.00%)      208.00 (  18.11%)
      Amean        Time      145.12 (   0.00%)      144.30 (   0.56%)
      Stddev       Time       12.74 (   0.00%)        9.62 (  24.49%)
      Coeff        Time        8.78 (   0.00%)        6.67 (  24.06%)
      Best99%Amean Time      144.29 (   0.00%)      143.82 (   0.32%)
      Best95%Amean Time      142.68 (   0.00%)      142.31 (   0.26%)
      Best90%Amean Time      142.52 (   0.00%)      142.19 (   0.24%)
      Best75%Amean Time      142.26 (   0.00%)      141.98 (   0.20%)
      Best50%Amean Time      141.90 (   0.00%)      141.71 (   0.13%)
      Best25%Amean Time      141.80 (   0.00%)      141.43 (   0.26%)
      
      The impact on bonnie is marginal and within the noise because a
      significant percentage of the file being truncated has been reclaimed
      and consists of shadow entries which reduce the hotness of the
      pagevec_release path.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-5-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9ed0d08
    • Mel Gorman's avatar
      mm, truncate: remove all exceptional entries from pagevec under one lock · f2187599
      Mel Gorman authored
      During truncate each entry in a pagevec is checked to see if it is an
      exceptional entry and if so, the shadow entry is cleaned up.  This is
      potentially expensive as multiple entries for a mapping locks/unlocks
      the tree lock.  This batches the operation such that any exceptional
      entries removed from a pagevec only acquire the mapping tree lock once.
      The corner case where this is more expensive is where there is only one
      exceptional entry but this is unlikely due to temporal locality and how
      it affects LRU ordering.  Note that for truncations of small files
      created recently, this patch should show no gain because it only batches
      the handling of exceptional entries.
      
      sparsetruncate (large)
                                    4.14.0-rc4             4.14.0-rc4
                               pickhelper-v1r1       batchshadow-v1r1
      Min          Time       38.00 (   0.00%)       27.00 (  28.95%)
      1st-qrtle    Time       40.00 (   0.00%)       28.00 (  30.00%)
      2nd-qrtle    Time       44.00 (   0.00%)       41.00 (   6.82%)
      3rd-qrtle    Time      146.00 (   0.00%)      147.00 (  -0.68%)
      Max-90%      Time      153.00 (   0.00%)      153.00 (   0.00%)
      Max-95%      Time      155.00 (   0.00%)      156.00 (  -0.65%)
      Max-99%      Time      181.00 (   0.00%)      171.00 (   5.52%)
      Amean        Time       93.04 (   0.00%)       88.43 (   4.96%)
      Best99%Amean Time       92.08 (   0.00%)       86.13 (   6.46%)
      Best95%Amean Time       89.19 (   0.00%)       83.13 (   6.80%)
      Best90%Amean Time       85.60 (   0.00%)       79.15 (   7.53%)
      Best75%Amean Time       72.95 (   0.00%)       65.09 (  10.78%)
      Best50%Amean Time       39.86 (   0.00%)       28.20 (  29.25%)
      Best25%Amean Time       39.44 (   0.00%)       27.70 (  29.77%)
      
      bonnie
                                            4.14.0-rc4             4.14.0-rc4
                                       pickhelper-v1r1       batchshadow-v1r1
      Hmean     SeqCreate ops         71.92 (   0.00%)       76.78 (   6.76%)
      Hmean     SeqCreate read        42.42 (   0.00%)       45.01 (   6.10%)
      Hmean     SeqCreate del      26519.88 (   0.00%)    27191.87 (   2.53%)
      Hmean     RandCreate ops        71.92 (   0.00%)       76.95 (   7.00%)
      Hmean     RandCreate read       44.44 (   0.00%)       49.23 (  10.78%)
      Hmean     RandCreate del     24948.62 (   0.00%)    24764.97 (  -0.74%)
      
      Truncation of a large number of files shows a substantial gain with 99%
      of files being truncated 6.46% faster.  bonnie shows a modest gain of
      2.53%
      
      [jack@suse.cz: fix truncate_exceptional_pvec_entries()]
        Link: http://lkml.kernel.org/r/20171108164226.26788-1-jack@suse.cz
      Link: http://lkml.kernel.org/r/20171018075952.10627-4-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f2187599
    • Mel Gorman's avatar
      mm, truncate: do not check mapping for every page being truncated · c7df8ad2
      Mel Gorman authored
      During truncation, the mapping has already been checked for shmem and
      dax so it's known that workingset_update_node is required.
      
      This patch avoids the checks on mapping for each page being truncated.
      In all other cases, a lookup helper is used to determine if
      workingset_update_node() needs to be called.  The one danger is that the
      API is slightly harder to use as calling workingset_update_node directly
      without checking for dax or shmem mappings could lead to surprises.
      However, the API rarely needs to be used and hopefully the comment is
      enough to give people the hint.
      
      sparsetruncate (tiny)
                                    4.14.0-rc4             4.14.0-rc4
                                   oneirq-v1r1        pickhelper-v1r1
      Min          Time      141.00 (   0.00%)      140.00 (   0.71%)
      1st-qrtle    Time      142.00 (   0.00%)      141.00 (   0.70%)
      2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
      3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
      Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
      Max-95%      Time      147.00 (   0.00%)      145.00 (   1.36%)
      Max-99%      Time      195.00 (   0.00%)      191.00 (   2.05%)
      Max          Time      230.00 (   0.00%)      205.00 (  10.87%)
      Amean        Time      144.37 (   0.00%)      143.82 (   0.38%)
      Stddev       Time       10.44 (   0.00%)        9.00 (  13.74%)
      Coeff        Time        7.23 (   0.00%)        6.26 (  13.41%)
      Best99%Amean Time      143.72 (   0.00%)      143.34 (   0.26%)
      Best95%Amean Time      142.37 (   0.00%)      142.00 (   0.26%)
      Best90%Amean Time      142.19 (   0.00%)      141.85 (   0.24%)
      Best75%Amean Time      141.92 (   0.00%)      141.58 (   0.24%)
      Best50%Amean Time      141.69 (   0.00%)      141.31 (   0.27%)
      Best25%Amean Time      141.38 (   0.00%)      140.97 (   0.29%)
      
      As you'd expect, the gain is marginal but it can be detected.  The
      differences in bonnie are all within the noise which is not surprising
      given the impact on the microbenchmark.
      
      radix_tree_update_node_t is a callback for some radix operations that
      optionally passes in a private field.  The only user of the callback is
      workingset_update_node and as it no longer requires a mapping, the
      private field is removed.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7df8ad2
    • Mel Gorman's avatar
      mm, page_alloc: enable/disable IRQs once when freeing a list of pages · 9cca35d4
      Mel Gorman authored
      Patch series "Follow-up for speed up page cache truncation", v2.
      
      This series is a follow-on for Jan Kara's series "Speed up page cache
      truncation" series.  We both ended up looking at the same problem but
      saw different problems based on the same data.  This series builds upon
      his work.
      
      A variety of workloads were compared on four separate machines but each
      machine showed gains albeit at different levels.  Minimally, some of the
      differences are due to NUMA where truncating data from a remote node is
      slower than a local node.  The workloads checked were
      
      o sparse truncate microbenchmark, tiny
      o sparse truncate microbenchmark, large
      o reaim-io disk workfile
      o dbench4 (modified by mmtests to produce more stable results)
      o filebench varmail configuration for small memory size
      o bonnie, directory operations, working set size 2*RAM
      
      reaim-io, dbench and filebench all showed minor gains.  Truncation does
      not dominate those workloads but were tested to ensure no other
      regressions.  They will not be reported further.
      
      The sparse truncate microbench was written by Jan.  It creates a number
      of files and then times how long it takes to truncate each one.  The
      "tiny" configuraiton creates a number of files that easily fits in
      memory and times how long it takes to truncate files with page cache.
      The large configuration uses enough files to have data that is twice the
      size of memory and so timings there include truncating page cache and
      working set shadow entries in the radix tree.
      
      Patches 1-4 are the most relevant parts of this series.  Patches 5-8 are
      optional as they are deleting code that is essentially useless but has a
      negligible performance impact.
      
      The changelogs have more information on performance but just for bonnie
      delete options, the main comparison is
      
      bonnie
                                            4.14.0-rc5             4.14.0-rc5             4.14.0-rc5
                                                jan-v2                vanilla                 mel-v2
      Hmean     SeqCreate ops         76.20 (   0.00%)       75.80 (  -0.53%)       76.80 (   0.79%)
      Hmean     SeqCreate read        85.00 (   0.00%)       85.00 (   0.00%)       85.00 (   0.00%)
      Hmean     SeqCreate del      13752.31 (   0.00%)    12090.23 ( -12.09%)    15304.84 (  11.29%)
      Hmean     RandCreate ops        76.00 (   0.00%)       75.60 (  -0.53%)       77.00 (   1.32%)
      Hmean     RandCreate read       96.80 (   0.00%)       96.80 (   0.00%)       97.00 (   0.21%)
      Hmean     RandCreate del     13233.75 (   0.00%)    11525.35 ( -12.91%)    14446.61 (   9.16%)
      
      Jan's series is the baseline and the vanilla kernel is 12% slower where
      as this series on top gains another 11%.  This is from a different
      machine than the data in the changelogs but the detailed data was not
      collected as there was no substantial change in v2.
      
      This patch (of 8):
      
      Freeing a list of pages current enables/disables IRQs for each page
      freed.  This patch splits freeing a list of pages into two operations --
      preparing the pages for freeing and the actual freeing.  This is a
      tradeoff - we're taking two passes of the list to free in exchange for
      avoiding multiple enable/disable of IRQs.
      
      sparsetruncate (tiny)
                                    4.14.0-rc4             4.14.0-rc4
                                 janbatch-v1r1            oneirq-v1r1
      Min          Time      149.00 (   0.00%)      141.00 (   5.37%)
      1st-qrtle    Time      150.00 (   0.00%)      142.00 (   5.33%)
      2nd-qrtle    Time      151.00 (   0.00%)      142.00 (   5.96%)
      3rd-qrtle    Time      151.00 (   0.00%)      143.00 (   5.30%)
      Max-90%      Time      153.00 (   0.00%)      144.00 (   5.88%)
      Max-95%      Time      155.00 (   0.00%)      147.00 (   5.16%)
      Max-99%      Time      201.00 (   0.00%)      195.00 (   2.99%)
      Max          Time      236.00 (   0.00%)      230.00 (   2.54%)
      Amean        Time      152.65 (   0.00%)      144.37 (   5.43%)
      Stddev       Time        9.78 (   0.00%)       10.44 (  -6.72%)
      Coeff        Time        6.41 (   0.00%)        7.23 ( -12.84%)
      Best99%Amean Time      152.07 (   0.00%)      143.72 (   5.50%)
      Best95%Amean Time      150.75 (   0.00%)      142.37 (   5.56%)
      Best90%Amean Time      150.59 (   0.00%)      142.19 (   5.58%)
      Best75%Amean Time      150.36 (   0.00%)      141.92 (   5.61%)
      Best50%Amean Time      150.04 (   0.00%)      141.69 (   5.56%)
      Best25%Amean Time      149.85 (   0.00%)      141.38 (   5.65%)
      
      With a tiny number of files, each file truncated has resident page cache
      and it shows that time to truncate is roughtly 5-6% with some minor
      jitter.
      
                                            4.14.0-rc4             4.14.0-rc4
                                         janbatch-v1r1            oneirq-v1r1
      Hmean     SeqCreate ops         65.27 (   0.00%)       81.86 (  25.43%)
      Hmean     SeqCreate read        39.48 (   0.00%)       47.44 (  20.16%)
      Hmean     SeqCreate del      24963.95 (   0.00%)    26319.99 (   5.43%)
      Hmean     RandCreate ops        65.47 (   0.00%)       82.01 (  25.26%)
      Hmean     RandCreate read       42.04 (   0.00%)       51.75 (  23.09%)
      Hmean     RandCreate del     23377.66 (   0.00%)    23764.79 (   1.66%)
      
      As expected, there is a small gain for the delete operation.
      
      [mgorman@techsingularity.net: use page_private and set_page_private helpers]
        Link: http://lkml.kernel.org/r/20171018101547.mjycw7zreb66jzpa@techsingularity.net
      Link: http://lkml.kernel.org/r/20171018075952.10627-2-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9cca35d4
    • Jan Kara's avatar
      mm: batch radix tree operations when truncating pages · aa65c29c
      Jan Kara authored
      Currently we remove pages from the radix tree one by one.  To speed up
      page cache truncation, lock several pages at once and free them in one
      go.  This allows us to batch radix tree operations in a more efficient
      way and also save round-trips on mapping->tree_lock.  As a result we
      gain about 20% speed improvement in page cache truncation.
      
      Data from a simple benchmark timing 10000 truncates of 1024 pages (on
      ext4 on ramdisk but the filesystem is barely visible in the profiles).
      The range shows 1% and 95% percentiles of the measured times:
      
        4.14-rc2	4.14-rc2 + batched truncation
        248-256	209-219
        249-258	209-217
        248-255	211-239
        248-255	209-217
        247-256	210-218
      
      [jack@suse.cz: convert delete_from_page_cache_batch() to pagevec]
        Link: http://lkml.kernel.org/r/20171018111648.13714-1-jack@suse.cz
      [akpm@linux-foundation.org: move struct pagevec forward declaration to top-of-file]
      Link: http://lkml.kernel.org/r/20171010151937.26984-8-jack@suse.cz
      
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa65c29c
    • Jan Kara's avatar
      mm: factor out checks and accounting from __delete_from_page_cache() · 5ecc4d85
      Jan Kara authored
      Move checks and accounting updates from __delete_from_page_cache() into
      a separate function.  We will reuse it when batching page cache
      truncation operations.
      
      Link: http://lkml.kernel.org/r/20171010151937.26984-7-jack@suse.cz
      
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ecc4d85
    • Jan Kara's avatar
      mm: move clearing of page->mapping to page_cache_tree_delete() · 2300638b
      Jan Kara authored
      Clearing of page->mapping makes sense in page_cache_tree_delete() as
      well and it will help us with batching things this way.
      
      Link: http://lkml.kernel.org/r/20171010151937.26984-6-jack@suse.cz
      
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2300638b
    • Jan Kara's avatar
      mm: move accounting updates before page_cache_tree_delete() · 76253fbc
      Jan Kara authored
      Move updates of various counters before page_cache_tree_delete() call.
      It will be easier to batch things this way and there is no difference
      whether the counters get updated before or after removal from the radix
      tree.
      
      Link: http://lkml.kernel.org/r/20171010151937.26984-5-jack@suse.cz
      
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76253fbc
    • Jan Kara's avatar
      mm: factor out page cache page freeing into a separate function · 59c66c5f
      Jan Kara authored
      Factor out page freeing from delete_from_page_cache() into a separate
      function.  We will need to call the same when batching pagecache
      deletion operations.
      
      invalidate_complete_page2() and replace_page_cache_page() might want to
      call this function as well however they currently don't seem to handle
      THPs so it's unnecessary for them to take the hit of checking whether a
      page is THP or not.
      
      Link: http://lkml.kernel.org/r/20171010151937.26984-4-jack@suse.cz
      
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59c66c5f
    • Jan Kara's avatar
      mm: refactor truncate_complete_page() · 9f4e41f4
      Jan Kara authored
      Move call of delete_from_page_cache() and page->mapping check out of
      truncate_complete_page() into the single caller - truncate_inode_page().
      Also move page_mapped() check into truncate_complete_page().  That way
      it will be easier to batch operations.
      
      Also rename truncate_complete_page() to truncate_cleanup_page().
      
      Link: http://lkml.kernel.org/r/20171010151937.26984-3-jack@suse.cz
      
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f4e41f4
    • Jan Kara's avatar
      mm: speed up cancel_dirty_page() for clean pages · 736304f3
      Jan Kara authored
      Patch series "Speed up page cache truncation", v1.
      
      When rebasing our enterprise distro to a newer kernel (from 4.4 to 4.12)
      we have noticed a regression in bonnie++ benchmark when deleting files.
      Eventually we have tracked this down to a fact that page cache
      truncation got slower by about 10%.  There were both gains and losses in
      the above interval of kernels but we have been able to identify that
      commit 83929372 ("filemap: prepare find and delete operations for
      huge pages") caused about 10% regression on its own.
      
      After some investigation it didn't seem easily possible to fix the
      regression while maintaining the THP in page cache functionality so
      we've decided to optimize the page cache truncation path instead to make
      up for the change.  This series is a result of that effort.
      
      Patch 1 is an easy speedup of cancel_dirty_page().  Patches 2-6 refactor
      page cache truncation code so that it is easier to batch radix tree
      operations.  Patch 7 implements batching of deletes from the radix tree
      which more than makes up for the original regression.
      
      This patch (of 7):
      
      cancel_dirty_page() does quite some work even for clean pages (fetching
      of mapping, locking of memcg, atomic bit op on page flags) so it
      accounts for ~2.5% of cost of truncation of a clean page.  That is not
      much but still dumb for something we don't need at all.  Check whether a
      page is actually dirty and avoid any work if not.
      
      Link: http://lkml.kernel.org/r/20171010151937.26984-2-jack@suse.cz
      
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      736304f3
    • Colin Ian King's avatar
      drivers/block/zram/zram_drv.c: make zram_page_end_io() static · 384bc41f
      Colin Ian King authored
      zram_page_end_io() is local to the source and does not need to be in
      global scope, so make it static.
      
      Cleans up sparse warning:
      
        symbol 'zram_page_end_io' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20171016173336.20320-1-colin.king@canonical.com
      
      
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      384bc41f
    • Kees Cook's avatar
      mm/page-writeback.c: convert timers to use timer_setup() · 9823e51b
      Kees Cook authored
      In preparation for unconditionally passing the struct timer_list pointer
      to all timer callbacks, switch to using the new timer_setup() and
      from_timer() to pass the timer pointer explicitly.
      
      Link: http://lkml.kernel.org/r/20171016225913.GA99214@beast
      
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Jeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9823e51b
    • Laszlo Toth's avatar
      mm, soft_offline: improve hugepage soft offlining error log · b6b18aa8
      Laszlo Toth authored
      On a failed attempt, we get the following entry: soft offline: 0x3c0000:
      migration failed 1, type 17ffffc0008008 (uptodate|head)
      
      Make this more specific to be straightforward and to follow other error
      log formats in soft_offline_huge_page().
      
      Link: http://lkml.kernel.org/r/20171016171757.GA3018@ubuntu-desk-vm
      
      
      Signed-off-by: default avatarLaszlo Toth <laszlth@gmail.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6b18aa8
    • Mike Rapoport's avatar
      userfaultfd: use mmgrab instead of open-coded increment of mm_count · 00bb31fa
      Mike Rapoport authored
      Link: http://lkml.kernel.org/r/1508132478-7738-1-git-send-email-rppt@linux.vnet.ibm.com
      
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00bb31fa
    • Aaron Lu's avatar
      mm/page_alloc: make sure __rmqueue() etc are always inline · 85ccc8fa
      Aaron Lu authored
      __rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and
      __rmqueue_cma_fallback() are all in page allocator's hot path and better
      be finished as soon as possible.  One way to make them faster is by making
      them inline.  But as Andrew Morton and Andi Kleen pointed out:
      
        https://lkml.org/lkml/2017/10/10/1252
        https://lkml.org/lkml/2017/10/10/1279
      
      To make sure they are inlined, we should use __always_inline for them.
      
      With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
      processes to stress buddy, the results for will-it-scale.processes with
      and without the patch are:
      
      On a 2-sockets Intel-Skylake machine:
      
         compiler          base        head
        gcc-4.4.7       6496131     6911823 +6.4%
        gcc-4.9.4       7225110     7731072 +7.0%
        gcc-5.4.1       7054224     7688146 +9.0%
        gcc-6.2.0       7059794     7651675 +8.4%
      
      On a 4-sockets Intel-Skylake machine:
      
         compiler          base        head
        gcc-4.4.7      13162890    13508193 +2.6%
        gcc-4.9.4      14997463    15484353 +3.2%
        gcc-5.4.1      14708711    15449805 +5.0%
        gcc-6.2.0      14574099    15349204 +5.3%
      
      The above 4 compilers are used because I've done the tests through
      Intel's Linux Kernel Performance(LKP) infrastructure and they are the
      available compilers there.
      
      The benefit being less on 4 sockets machine is due to the lock
      contention there(perf-profile/native_queued_spin_lock_slowpath=81%) is
      less severe than on the 2 sockets machine(85%).
      
      What the benchmark does is: it forks nr_cpu processes and then each
      process does the following:
          1 mmap() 128M anonymous space;
          2 writes to each page there to trigger actual page allocation;
          3 munmap() it.
      in a loop.
      
        https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
      
      Binary size wise, I have locally built them with different compilers:
      
      [aaron@aaronlu obj]$ size */*/mm/page_alloc.o
         text    data     bss     dec     hex filename
        37409    9904    8524   55837    da1d gcc-4.9.4/base/mm/page_alloc.o
        38273    9904    8524   56701    dd7d gcc-4.9.4/head/mm/page_alloc.o
        37465    9840    8428   55733    d9b5 gcc-5.5.0/base/mm/page_alloc.o
        38169    9840    8428   56437    dc75 gcc-5.5.0/head/mm/page_alloc.o
        37573    9840    8428   55841    da21 gcc-6.4.0/base/mm/page_alloc.o
        38261    9840    8428   56529    dcd1 gcc-6.4.0/head/mm/page_alloc.o
        36863    9840    8428   55131    d75b gcc-7.2.0/base/mm/page_alloc.o
        37711    9840    8428   55979    daab gcc-7.2.0/head/mm/page_alloc.o
      
      Text size increased about 800 bytes for mm/page_alloc.o.
      
      [aaron@aaronlu obj]$ size */*/vmlinux
         text    data     bss     dec       hex     filename
      10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/base/vmlinux
      10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/head/vmlinux
      10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/base/vmlinux
      10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/head/vmlinux
      10094546   5836696 17715200 33646442  201676a gcc-6.4.0/base/vmlinux
      10094546   5836696 17715200 33646442  201676a gcc-6.4.0/head/vmlinux
      10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/base/vmlinux
      10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/head/vmlinux
      
      Text size for vmlinux has no change though, probably due to function
      alignment.
      
      Link: http://lkml.kernel.org/r/20171013063111.GA26032@intel.com
      
      
      Signed-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85ccc8fa
    • Pavel Tatashin's avatar
      sparc64: optimize struct page zeroing · 78c94366
      Pavel Tatashin authored
      Add an optimized mm_zero_struct_page(), so struct page's are zeroed
      without calling memset().  We do eight to ten regular stores based on
      the size of struct page.  Compiler optimizes out the conditions of
      switch() statement.
      
      SPARC-M6 with 15T of memory, single thread performance:
      
                                     BASE            FIX  OPTIMIZED_FIX
              bootmem_init   28.440467985s   2.305674818s   2.305161615s
      free_area_init_nodes  202.845901673s 225.343084508s 172.556506560s
                            --------------------------------------------
      Total                 231.286369658s 227.648759326s 174.861668175s
      
      BASE:  current linux
      FIX:   This patch series without "optimized struct page zeroing"
      OPTIMIZED_FIX: This patch series including the current patch.
      
      bootmem_init() is where memory for struct pages is zeroed during
      allocation.  Note, about two seconds in this function is a fixed time:
      it does not increase as memory is increased.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-11-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarBob Picco <bob.picco@oracle.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78c94366
    • Pavel Tatashin's avatar
      mm: stop zeroing memory during allocation in vmemmap · f7f99100
      Pavel Tatashin authored
      vmemmap_alloc_block() will no longer zero the block, so zero memory at
      its call sites for everything except struct pages.  Struct page memory
      is zero'd by struct page initialization.
      
      Replace allocators in sparse-vmemmap to use the non-zeroing version.
      So, we will get the performance improvement by zeroing the memory in
      parallel when struct pages are zeroed.
      
      Add struct page zeroing as a part of initialization of other fields in
      __init_single_page().
      
      This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
      v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):
      
                               BASE            FIX
      sparse_init     11.244671836s   0.007199623s
      zone_sizes_init  4.879775891s   8.355182299s
                        --------------------------
      Total           16.124447727s   8.362381922s
      
      sparse_init is where memory for struct pages is zeroed, and the zeroing
      part is moved later in this patch into __init_single_page(), which is
      called from zone_sizes_init().
      
      [akpm@linux-foundation.org: make vmemmap_alloc_block_zero() private to sparse-vmemmap.c]
      Link: http://lkml.kernel.org/r/20171013173214.27300-10-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarBob Picco <bob.picco@oracle.com>
      Tested-by: default avatarBob Picco <bob.picco@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7f99100
    • Will Deacon's avatar
      arm64/mm/kasan: don't use vmemmap_populate() to initialize shadow · e17d8025
      Will Deacon authored
      The kasan shadow is currently mapped using vmemmap_populate() since that
      provides a semi-convenient way to map pages into init_top_pgt.  However,
      since that no longer zeroes the mapped pages, it is not suitable for
      kasan, which requires zeroed shadow memory.
      
      Add kasan_populate_shadow() interface and use it instead of
      vmemmap_populate().  Besides, this allows us to take advantage of
      gigantic pages and use them to populate the shadow, which should save us
      some memory wasted on page tables and reduce TLB pressure.
      
      Link: http://lkml.kernel.org/r/20171103185147.2688-3-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e17d8025
    • Andrey Ryabinin's avatar
      x86/mm/kasan: don't use vmemmap_populate() to initialize shadow · d17a1d97
      Andrey Ryabinin authored
      The kasan shadow is currently mapped using vmemmap_populate() since that
      provides a semi-convenient way to map pages into init_top_pgt.  However,
      since that no longer zeroes the mapped pages, it is not suitable for
      kasan, which requires zeroed shadow memory.
      
      Add kasan_populate_shadow() interface and use it instead of
      vmemmap_populate().  Besides, this allows us to take advantage of
      gigantic pages and use them to populate the shadow, which should save us
      some memory wasted on page tables and reduce TLB pressure.
      
      Link: http://lkml.kernel.org/r/20171103185147.2688-2-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d17a1d97
    • Pavel Tatashin's avatar
      mm: zero reserved and unavailable struct pages · a4a3ede2
      Pavel Tatashin authored
      Some memory is reserved but unavailable: not present in memblock.memory
      (because not backed by physical pages), but present in memblock.reserved.
      Such memory has backing struct pages, but they are not initialized by
      going through __init_single_page().
      
      In some cases these struct pages are accessed even if they do not
      contain any data.  One example is page_to_pfn() might access page->flags
      if this is where section information is stored (CONFIG_SPARSEMEM,
      SECTION_IN_PAGE_FLAGS).
      
      One example of such memory: trim_low_memory_range() unconditionally
      reserves from pfn 0, but e820__memblock_setup() might provide the
      exiting memory from pfn 1 (i.e.  KVM).
      
      Since struct pages are zeroed in __init_single_page(), and not during
      allocation time, we must zero such struct pages explicitly.
      
      The patch involves adding a new memblock iterator:
      	for_each_resv_unavail_range(i, p_start, p_end)
      
      Which iterates through reserved && !memory lists, and we zero struct pages
      explicitly by calling mm_zero_struct_page().
      
      ===
      
      Here is more detailed example of problem that this patch is addressing:
      
      Run tested on qemu with the following arguments:
      
      	-enable-kvm -cpu kvm64 -m 512 -smp 2
      
      This patch reports that there are 98 unavailable pages.
      
      They are: pfn 0 and pfns in range [159, 255].
      
      Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does
      not reserve [159, 255] ones.
      
      e820__memblock_setup() reports linux that the following physical ranges are
      available:
          [1 , 158]
      [256, 130783]
      
      Notice, that exactly unavailable pfns are missing!
      
      Now, lets check what we have in zone 0: [1, 131039]
      
      pfn 0, is not part of the zone, but pfns [1, 158], are.
      
      However, the bigger problem we have if we do not initialize these struct
      pages is with memory hotplug.  Because, that path operates at 2M
      boundaries (section_nr).  And checks if 2M range of pages is hot
      removable.  It starts with first pfn from zone, rounds it down to 2M
      boundary (sturct pages are allocated at 2M boundaries when vmemmap is
      created), and checks if that section is hot removable.  In this case
      start with pfn 1 and convert it down to pfn 0.  Later pfn is converted
      to struct page, and some fields are checked.  Now, if we do not zero
      struct pages, we get unpredictable results.
      
      In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all
      vmemmap memory to ones, the following panic is observed with kernel test
      without this patch applied:
      
        BUG: unable to handle kernel NULL pointer dereference at          (null)
        IP: is_pageblock_removable_nolock+0x35/0x90
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT
        ...
        task: ffff88001f4e2900 task.stack: ffffc90000314000
        RIP: 0010:is_pageblock_removable_nolock+0x35/0x90
        Call Trace:
         ? is_mem_section_removable+0x5a/0xd0
         show_mem_removable+0x6b/0xa0
         dev_attr_show+0x1b/0x50
         sysfs_kf_seq_show+0xa1/0x100
         kernfs_seq_show+0x22/0x30
         seq_read+0x1ac/0x3a0
         kernfs_fop_read+0x36/0x190
         ? security_file_permission+0x90/0xb0
         __vfs_read+0x16/0x30
         vfs_read+0x81/0x130
         SyS_read+0x44/0xa0
         entry_SYSCALL_64_fastpath+0x1f/0xbd
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-7-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarBob Picco <bob.picco@oracle.com>
      Tested-by: default avatarBob Picco <bob.picco@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4a3ede2
    • Pavel Tatashin's avatar
      mm: define memblock_virt_alloc_try_nid_raw · ea1f5f37
      Pavel Tatashin authored
      * A new variant of memblock_virt_alloc_* allocations:
      memblock_virt_alloc_try_nid_raw()
          - Does not zero the allocated memory
          - Does not panic if request cannot be satisfied
      
      * optimize early system hash allocations
      
      Clients can call alloc_large_system_hash() with flag: HASH_ZERO to
      specify that memory that was allocated for system hash needs to be
      zeroed, otherwise the memory does not need to be zeroed, and client will
      initialize it.
      
      If memory does not need to be zero'd, call the new
      memblock_virt_alloc_raw() interface, and thus improve the boot
      performance.
      
      * debug for raw alloctor
      
      When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
      returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
      places excpect zeroed memory.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-6-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarBob Picco <bob.picco@oracle.com>
      Tested-by: default avatarBob Picco <bob.picco@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea1f5f37
    • Pavel Tatashin's avatar
      sparc64: simplify vmemmap_populate · df8ee578
      Pavel Tatashin authored
      Remove duplicating code by using common functions vmemmap_pud_populate
      and vmemmap_pgd_populate.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-5-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarBob Picco <bob.picco@oracle.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df8ee578
    • Pavel Tatashin's avatar
      sparc64/mm: set fields in deferred pages · 2a20aa17
      Pavel Tatashin authored
      Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
      flags and other fields in "struct page"es are never changed prior to
      first initializing struct pages by going through __init_single_page().
      
      With deferred struct page feature enabled there is a case where we set
      some fields prior to initializing:
      
      mem_init() {
           register_page_bootmem_info();
           free_all_bootmem();
           ...
      }
      
      When register_page_bootmem_info() is called only non-deferred struct
      pages are initialized.  But, this function goes through some reserved
      pages which might be part of the deferred, and thus are not yet
      initialized.
      
      mem_init
      register_page_bootmem_info
      register_page_bootmem_info_node
       get_page_bootmem
        .. setting fields here ..
        such as: page->freelist = (void *)type;
      
      free_all_bootmem()
      free_low_memory_core_early()
       for_each_reserved_mem_region()
        reserve_bootmem_region()
         init_reserved_page() <- Only if this is deferred reserved page
          __init_single_pfn()
           __init_single_page()
            memset(0) <-- Loose the set fields here
      
      We end up with similar issue as in the previous patch, where currently
      we do not observe problem as memory is zeroed.  But, if flag asserts are
      changed we can start hitting issues.
      
      Also, because in this patch series we will stop zeroing struct page
      memory during allocation, we must make sure that struct pages are
      properly initialized prior to using them.
      
      The deferred-reserved pages are initialized in free_all_bootmem().
      Therefore, the fix is to switch the above calls.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-4-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarBob Picco <bob.picco@oracle.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a20aa17
    • Pavel Tatashin's avatar
      x86/mm: set fields in deferred pages · 353b1e7b
      Pavel Tatashin authored
      Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
      flags and other fields in "struct page"es are never changed prior to
      first initializing struct pages by going through __init_single_page().
      
      With deferred struct page feature enabled, however, we set fields in
      register_page_bootmem_info that are subsequently clobbered right after
      in free_all_bootmem:
      
              mem_init() {
                      register_page_bootmem_info();
                      free_all_bootmem();
                      ...
              }
      
      When register_page_bootmem_info() is called only non-deferred struct
      pages are initialized.  But, this function goes through some reserved
      pages which might be part of the deferred, and thus are not yet
      initialized.
      
        mem_init
         register_page_bootmem_info
          register_page_bootmem_info_node
           get_page_bootmem
            .. setting fields here ..
            such as: page->freelist = (void *)type;
      
        free_all_bootmem()
         free_low_memory_core_early()
          for_each_reserved_mem_region()
           reserve_bootmem_region()
            init_reserved_page() <- Only if this is deferred reserved page
             __init_single_pfn()
              __init_single_page()
                  memset(0) <-- Loose the set fields here
      
      We end up with issue where, currently we do not observe problem as
      memory is explicitly zeroed.  But, if flag asserts are changed we can
      start hitting issues.
      
      Also, because in this patch series we will stop zeroing struct page
      memory during allocation, we must make sure that struct pages are
      properly initialized prior to using them.
      
      The deferred-reserved pages are initialized in free_all_bootmem().
      Therefore, the fix is to switch the above calls.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-3-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarBob Picco <bob.picco@oracle.com>
      Tested-by: default avatarBob Picco <bob.picco@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      353b1e7b
    • Pavel Tatashin's avatar
      mm: deferred_init_memmap improvements · 2f47a91f
      Pavel Tatashin authored
      Patch series "complete deferred page initialization", v12.
      
      SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config
      option, which defers initializing struct pages until all cpus have been
      started so it can be done in parallel.
      
      However, this feature is sub-optimal, because the deferred page
      initialization code expects that the struct pages have already been
      zeroed, and the zeroing is done early in boot with a single thread only.
      Also, we access that memory and set flags before struct pages are
      initialized.  All of this is fixed in this patchset.
      
      In this work we do the following:
       - Never read access struct page until it was initialized
       - Never set any fields in struct pages before they are initialized
       - Zero struct page at the beginning of struct page initialization
      
      ==========================================================================
      Performance improvements on x86 machine with 8 nodes:
      Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
                              TIME          SPEED UP
      base no deferred:       95.796233s
      fix no deferred:        79.978956s    19.77%
      
      base deferred:          77.254713s
      fix deferred:           55.050509s    40.34%
      ==========================================================================
      SPARC M6 3600 MHz with 15T of memory
                              TIME          SPEED UP
      base no deferred:       358.335727s
      fix no deferred:        302.320936s   18.52%
      
      base deferred:          237.534603s
      fix deferred:           182.103003s   30.44%
      ==========================================================================
      Raw dmesg output with timestamps:
      x86 base no deferred:    https://hastebin.com/ofunepurit.scala
      x86 base deferred:       https://hastebin.com/ifazegeyas.scala
      x86 fix no deferred:     https://hastebin.com/pegocohevo.scala
      x86 fix deferred:        https://hastebin.com/ofupevikuk.scala
      sparc base no deferred:  https://hastebin.com/ibobeteken.go
      sparc base deferred:     https://hastebin.com/fariqimiyu.go
      sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
      sparc fix deferred:      https://hastebin.com/xadinobutu.go
      
      This patch (of 11):
      
      deferred_init_memmap() is called when struct pages are initialized later
      in boot by slave CPUs.  This patch simplifies and optimizes this
      function, and also fixes a couple issues (described below).
      
      The main change is that now we are iterating through free memblock areas
      instead of all configured memory.  Thus, we do not have to check if the
      struct page has already been initialized.
      
      =====
      In deferred_init_memmap() where all deferred struct pages are
      initialized we have a check like this:
      
        if (page->flags) {
      	VM_BUG_ON(page_zone(page) != zone);
      	goto free_range;
        }
      
      This way we are checking if the current deferred page has already been
      initialized.  It works, because memory for struct pages has been zeroed,
      and the only way flags are not zero if it went through
      __init_single_page() before.  But, once we change the current behavior
      and won't zero the memory in memblock allocator, we cannot trust
      anything inside "struct page"es until they are initialized.  This patch
      fixes this.
      
      The deferred_init_memmap() is re-written to loop through only free
      memory ranges provided by memblock.
      
      Note, this first issue is relevant only when the following change is
      merged:
      
      =====
      This patch fixes another existing issue on systems that have holes in
      zones i.e CONFIG_HOLES_IN_ZONE is defined.
      
      In for_each_mem_pfn_range() we have code like this:
      
        if (!pfn_valid_within(pfn)
      	goto free_range;
      
      Note: 'page' is not set to NULL and is not incremented but 'pfn'
      advances.  Thus means if deferred struct pages are enabled on systems
      with these kind of holes, linux would get memory corruptions.  I have
      fixed this issue by defining a new macro that performs all the necessary
      operations when we free the current set of pages.
      
      [pasha.tatashin@oracle.com: buddy page accessed before initialized]
        Link: http://lkml.kernel.org/r/20171102170221.7401-2-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20171013173214.27300-2-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarBob Picco <bob.picco@oracle.com>
      Tested-by: default avatarBob Picco <bob.picco@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f47a91f
    • Changbin Du's avatar
      mm/swap_state.c: declare a few variables as __read_mostly · 783cb68e
      Changbin Du authored
      These global variables are only set during initialization or rarely
      change, so declare them as __read_mostly.
      
      Link: http://lkml.kernel.org/r/1507802349-5554-1-git-send-email-changbin.du@intel.com
      
      
      Signed-off-by: default avatarChangbin Du <changbin.du@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      783cb68e