Skip to content
  1. Apr 08, 2014
    • Johannes Weiner's avatar
      mm: page_alloc: spill to remote nodes before waking kswapd · 3a025760
      Johannes Weiner authored
      On NUMA systems, a node may start thrashing cache or even swap anonymous
      pages while there are still free pages on remote nodes.
      
      This is a result of commits 81c0a2bb ("mm: page_alloc: fair zone
      allocator policy") and fff4068c
      
       ("mm: page_alloc: revert NUMA aspect
      of fair allocation policy").
      
      Before those changes, the allocator would first try all allowed zones,
      including those on remote nodes, before waking any kswapds.  But now,
      the allocator fastpath doubles as the fairness pass, which in turn can
      only consider the local node to prevent remote spilling based on
      exhausted fairness batches alone.  Remote nodes are only considered in
      the slowpath, after the kswapds are woken up.  But if remote nodes still
      have free memory, kswapd should not be woken to rebalance the local node
      or it may thrash cash or swap prematurely.
      
      Fix this by adding one more unfair pass over the zonelist that is
      allowed to spill to remote nodes after the local fairness pass fails but
      before entering the slowpath and waking the kswapds.
      
      This also gets rid of the GFP_THISNODE exemption from the fairness
      protocol because the unfair pass is no longer tied to kswapd, which
      GFP_THISNODE is not allowed to wake up.
      
      However, because remote spills can be more frequent now - we prefer them
      over local kswapd reclaim - the allocation batches on remote nodes could
      underflow more heavily.  When resetting the batches, use
      atomic_long_read() directly instead of zone_page_state() to calculate the
      delta as the latter filters negative counter values.
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: <stable@kernel.org>		[3.12+]
      
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a025760
    • Michal Hocko's avatar
      memcg: rename high level charging functions · d715ae08
      Michal Hocko authored
      
      
      mem_cgroup_newpage_charge is used only for charging anonymous memory so
      it is better to rename it to mem_cgroup_charge_anon.
      
      mem_cgroup_cache_charge is used for file backed memory so rename it to
      mem_cgroup_charge_file.
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d715ae08
    • Johannes Weiner's avatar
      memcg: sanitize __mem_cgroup_try_charge() call protocol · 6d1fdc48
      Johannes Weiner authored
      
      
      Some callsites pass a memcg directly, some callsites pass an mm that
      then has to be translated to a memcg.  This makes for a terrible
      function interface.
      
      Just push the mm-to-memcg translation into the respective callsites and
      always pass a memcg to mem_cgroup_try_charge().
      
      [mhocko@suse.cz: add charge mm helper]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d1fdc48
    • Michal Hocko's avatar
      memcg: do not replicate get_mem_cgroup_from_mm in __mem_cgroup_try_charge · b6b6cc72
      Michal Hocko authored
      
      
      __mem_cgroup_try_charge duplicates get_mem_cgroup_from_mm for charges
      which came without a memcg.  The only reason seems to be a tiny
      optimization when css_tryget is not called if the charge can be consumed
      from the stock.  Nevertheless css_tryget is very cheap since it has been
      reworked to use per-cpu counting so this optimization doesn't give us
      anything these days.
      
      So let's drop the code duplication so that the code is more readable.
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6b6cc72
    • Johannes Weiner's avatar
      memcg: get_mem_cgroup_from_mm() · df381975
      Johannes Weiner authored
      
      
      Instead of returning NULL from try_get_mem_cgroup_from_mm() when the mm
      owner is exiting, just return root_mem_cgroup.  This makes sense for all
      callsites and gets rid of some of them having to fallback manually.
      
      [fengguang.wu@intel.com: fix warnings]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df381975
    • Johannes Weiner's avatar
      memcg: remove unnecessary !mm check from try_get_mem_cgroup_from_mm() · 03583f1a
      Johannes Weiner authored
      
      
      Users pass either a mm that has been established under task lock, or use
      a verified current->mm, which means the task can't be exiting.
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03583f1a
    • Johannes Weiner's avatar
      mm: memcg: push !mm handling out to page cache charge function · 284f39af
      Johannes Weiner authored
      
      
      Only page cache charges can happen without an mm context, so push this
      special case out of the inner core and into the cache charge function.
      
      An ancient comment explains that the mm can also be NULL in case the
      task is currently being migrated, but that is not actually true with the
      current case, so just remove it.
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      284f39af
    • Johannes Weiner's avatar
      mm: memcg: inline mem_cgroup_charge_common() · 1bec6b33
      Johannes Weiner authored
      
      
      mem_cgroup_charge_common() is used by both cache and anon pages, but
      most of its body only applies to anon pages and the remainder is not
      worth having in a separate function.
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1bec6b33
    • Johannes Weiner's avatar
      mm: memcg: remove mem_cgroup_move_account_page_stat() · 59d1d256
      Johannes Weiner authored
      
      
      It used to disable preemption and run sanity checks but now it's only
      taking a number out of one percpu counter and putting it into another.
      Do this directly in the callsite and save the indirection.
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59d1d256
    • Johannes Weiner's avatar
      mm: memcg: remove unnecessary preemption disabling · 7af467e8
      Johannes Weiner authored
      
      
      lock_page_cgroup() disables preemption, remove explicit preemption
      disabling for code paths holding this lock.
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7af467e8
    • Kirill A. Shutemov's avatar
      mm: use 'const char *' insted of 'char *' for reason in dump_page() · d230dec1
      Kirill A. Shutemov authored
      
      
      I tried to use 'dump_page(page, __func__)' for debugging, but it triggers
      warning:
      
        warning: passing argument 2 of `dump_page' discards `const' qualifier from pointer target type [enabled by default]
      
      Let's convert 'reason' to 'const char *' in dump_page() and friends: we
      shouldn't modify it anyway.
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d230dec1
    • Gioh Kim's avatar
      mm/vmalloc.c: enhance vm_map_ram() comment · 36437638
      Gioh Kim authored
      
      
      vm_map_ram() has a fragmentation problem when it cannot purge a
      chunk(ie, 4M address space) if there is a pinning object in that
      addresss space.  So it could consume all VMALLOC address space easily.
      
      We can fix the fragmentation problem by using vmap instead of
      vm_map_ram() but vmap() is known to be slow compared to vm_map_ram().
      Minchan said vm_map_ram is 5 times faster than vmap in his tests.  So I
      thought we should fix fragment problem of vm_map_ram because our
      proprietary GPU driver has used it heavily.
      
      On second thought, it's not an easy because we should reuse freed space
      for solving the problem and it could make more IPI and bitmap operation
      for searching hole.  It could mitigate API's goal which is very fast
      mapping.  And even fragmentation problem wouldn't show in 64 bit
      machine.
      
      Another option is that the user should separate long-life and short-life
      object and use vmap for long-life but vm_map_ram for short-life.  If we
      inform the user about the characteristic of vm_map_ram the user can
      choose one according to the page lifetime.
      
      Let's add some notice messages to user.
      
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: default avatarGioh Kim <gioh.kim@lge.com>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36437638
    • Choi Gi-yong's avatar
    • Mikulas Patocka's avatar
      mempool: add unlikely and likely hints · eb9a3c62
      Mikulas Patocka authored
      
      
      Add unlikely and likely hints to the function mempool_free.  It lays out
      the code in such a way that the common path is executed straighforward and
      saves a cache line.
      
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb9a3c62
    • David Rientjes's avatar
      mm, compaction: determine isolation mode only once · da1c67a7
      David Rientjes authored
      
      
      The conditions that control the isolation mode in
      isolate_migratepages_range() do not change during the iteration, so
      extract them out and only define the value once.
      
      This actually does have an effect, gcc doesn't optimize it itself because
      of cc->sync.
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da1c67a7
    • David Rientjes's avatar
      res_counter: remove interface for locked charging and uncharging · 539a13b4
      David Rientjes authored
      
      
      The res_counter_{charge,uncharge}_locked() variants are not used in the
      kernel outside of the resource counter code itself, so remove the
      interface.
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      539a13b4
    • David Rientjes's avatar
      mm, mempolicy: remove per-process flag · f0432d15
      David Rientjes authored
      
      
      PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
      There's no significant performance degradation to checking
      current->mempolicy rather than current->flags & PF_MEMPOLICY in the
      allocation path, especially since this is considered unlikely().
      
      Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
      64GB of memory and without a mempolicy:
      
      	threads		before		after
      	16		1249409		1244487
      	32		1281786		1246783
      	48		1239175		1239138
      	64		1244642		1241841
      	80		1244346		1248918
      	96		1266436		1254316
      	112		1307398		1312135
      	128		1327607		1326502
      
      Per-process flags are a scarce resource so we should free them up whenever
      possible and make them available.  We'll be using it shortly for memcg oom
      reserves.
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0432d15
    • David Rientjes's avatar
      mm, mempolicy: rename slab_node for clarity · 2a389610
      David Rientjes authored
      
      
      slab_node() is actually a mempolicy function, so rename it to
      mempolicy_slab_node() to make it clearer that it used for processes with
      mempolicies.
      
      At the same time, cleanup its code by saving numa_mem_id() in a local
      variable (since we require a node with memory, not just any node) and
      remove an obsolete comment that assumes the mempolicy is actually passed
      into the function.
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a389610
    • David Rientjes's avatar
      fork: collapse copy_flags into copy_process · 514ddb44
      David Rientjes authored
      
      
      copy_flags() does not use the clone_flags formal and can be collapsed
      into copy_process() for cleaner code.
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      514ddb44
    • Gideon Israel Dsouza's avatar
      mm: use macros from compiler.h instead of __attribute__((...)) · 3b32123d
      Gideon Israel Dsouza authored
      
      
      To increase compiler portability there is <linux/compiler.h> which
      provides convenience macros for various gcc constructs.  Eg: __weak for
      __attribute__((weak)).  I've replaced all instances of gcc attributes with
      the right macro in the memory management (/mm) subsystem.
      
      [akpm@linux-foundation.org: while-we're-there consistency tweaks]
      Signed-off-by: default avatarGideon Israel Dsouza <gidisrael@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b32123d
    • Davidlohr Bueso's avatar
      mm: per-thread vma caching · 615d6e87
      Davidlohr Bueso authored
      This patch is a continuation of efforts trying to optimize find_vma(),
      avoiding potentially expensive rbtree walks to locate a vma upon faults.
      The original approach (https://lkml.org/lkml/2013/11/1/410
      
      ), where the
      largest vma was also cached, ended up being too specific and random,
      thus further comparison with other approaches were needed.  There are
      two things to consider when dealing with this, the cache hit rate and
      the latency of find_vma().  Improving the hit-rate does not necessarily
      translate in finding the vma any faster, as the overhead of any fancy
      caching schemes can be too high to consider.
      
      We currently cache the last used vma for the whole address space, which
      provides a nice optimization, reducing the total cycles in find_vma() by
      up to 250%, for workloads with good locality.  On the other hand, this
      simple scheme is pretty much useless for workloads with poor locality.
      Analyzing ebizzy runs shows that, no matter how many threads are
      running, the mmap_cache hit rate is less than 2%, and in many situations
      below 1%.
      
      The proposed approach is to replace this scheme with a small per-thread
      cache, maximizing hit rates at a very low maintenance cost.
      Invalidations are performed by simply bumping up a 32-bit sequence
      number.  The only expensive operation is in the rare case of a seq
      number overflow, where all caches that share the same address space are
      flushed.  Upon a miss, the proposed replacement policy is based on the
      page number that contains the virtual address in question.  Concretely,
      the following results are seen on an 80 core, 8 socket x86-64 box:
      
      1) System bootup: Most programs are single threaded, so the per-thread
         scheme does improve ~50% hit rate by just adding a few more slots to
         the cache.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 50.61%   | 19.90            |
      | patched        | 73.45%   | 13.58            |
      +----------------+----------+------------------+
      
      2) Kernel build: This one is already pretty good with the current
         approach as we're dealing with good locality.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 75.28%   | 11.03            |
      | patched        | 88.09%   | 9.31             |
      +----------------+----------+------------------+
      
      3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 70.66%   | 17.14            |
      | patched        | 91.15%   | 12.57            |
      +----------------+----------+------------------+
      
      4) Ebizzy: There's a fair amount of variation from run to run, but this
         approach always shows nearly perfect hit rates, while baseline is just
         about non-existent.  The amounts of cycles can fluctuate between
         anywhere from ~60 to ~116 for the baseline scheme, but this approach
         reduces it considerably.  For instance, with 80 threads:
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 1.06%    | 91.54            |
      | patched        | 99.97%   | 14.18            |
      +----------------+----------+------------------+
      
      [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
      [akpm@linux-foundation.org: document vmacache_valid() logic]
      [akpm@linux-foundation.org: attempt to untangle header files]
      [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
      [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: adjust and enhance comments]
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      615d6e87
    • Ning Qu's avatar
      mm: implement ->map_pages for shmem/tmpfs · d7c17551
      Ning Qu authored
      
      
      In shmem/tmpfs, we also use the generic filemap_map_pages, seems the
      additional checking is not worth a separate version of map_pages for it.
      
      Signed-off-by: default avatarNing Qu <quning@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7c17551
    • Kirill A. Shutemov's avatar
      mm: add debugfs tunable for fault_around_order · 1592eef0
      Kirill A. Shutemov authored
      
      
      Let's allow people to tweak faultaround at runtime.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1592eef0
    • Kirill A. Shutemov's avatar
      mm: cleanup size checks in filemap_fault() and filemap_map_pages() · 99e3e53f
      Kirill A. Shutemov authored
      
      
      Minor cleanups:
       - 'size' variable is now in bytes, not pages;
       - use round_up(): it should be easier to read.
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99e3e53f
    • Kirill A. Shutemov's avatar
      mm: implement ->map_pages for page cache · f1820361
      Kirill A. Shutemov authored
      
      
      filemap_map_pages() is generic implementation of ->map_pages() for
      filesystems who uses page cache.
      
      It should be safe to use filemap_map_pages() for ->map_pages() if
      filesystem use filemap_fault() for ->fault().
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1820361
    • Kirill A. Shutemov's avatar
      mm: introduce vm_ops->map_pages() · 8c6e50b0
      Kirill A. Shutemov authored
      
      
      Here's new version of faultaround patchset.  It took a while to tune it
      and collect performance data.
      
      First patch adds new callback ->map_pages to vm_operations_struct.
      
      ->map_pages() is called when VM asks to map easy accessible pages.
      Filesystem should find and map pages associated with offsets from
      "pgoff" till "max_pgoff".  ->map_pages() is called with page table
      locked and must not block.  If it's not possible to reach a page without
      blocking, filesystem should skip it.  Filesystem should use do_set_pte()
      to setup page table entry.  Pointer to entry associated with offset
      "pgoff" is passed in "pte" field in vm_fault structure.  Pointers to
      entries for other offsets should be calculated relative to "pte".
      
      Currently VM use ->map_pages only on read page fault path.  We try to
      map FAULT_AROUND_PAGES a time.  FAULT_AROUND_PAGES is 16 for now.
      Performance data for different FAULT_AROUND_ORDER is below.
      
      TODO:
       - implement ->map_pages() for shmem/tmpfs;
       - modify get_user_pages() to be able to use ->map_pages() and implement
         mmap(MAP_POPULATE|MAP_NONBLOCK) on top.
      
      =========================================================================
      Tested on 4-socket machine (120 threads) with 128GiB of RAM.
      
      Few real-world workloads. The sweet spot for FAULT_AROUND_ORDER here is
      somewhere between 3 and 5. Let's say 4 :)
      
      Linux build (make -j60)
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
      	minor-faults		283,301,572	247,151,987	212,215,789	204,772,882	199,568,944	194,703,779	193,381,485
      	time, seconds		151.227629483	153.920996480	151.356125472	150.863792049	150.879207877	151.150764954	151.450962358
      Linux rebuild (make -j60)
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
      	minor-faults		5,396,854	4,148,444	2,855,286	2,577,282	2,361,957	2,169,573	2,112,643
      	time, seconds		27.404543757	27.559725591	27.030057426	26.855045126	26.678618635	26.974523490	26.761320095
      Git test suite (make -j60 test)
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
      	minor-faults		129,591,823	99,200,751	66,106,718	57,606,410	51,510,808	45,776,813	44,085,515
      	time, seconds		66.087215026	64.784546905	64.401156567	65.282708668	66.034016829	66.793780811	67.237810413
      
      Two synthetic tests: access every word in file in sequential/random order.
      It doesn't improve much after FAULT_AROUND_ORDER == 4.
      
      Sequential access 16GiB file
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
       1 thread
      	minor-faults		4,195,437	2,098,275	525,068		262,251		131,170		32,856		8,282
      	time, seconds		7.250461742	6.461711074	5.493859139	5.488488147	5.707213983	5.898510832	5.109232856
       8 threads
      	minor-faults		33,557,540	16,892,728	4,515,848	2,366,999	1,423,382	442,732		142,339
      	time, seconds		16.649304881	9.312555263	6.612490639	6.394316732	6.669827501	6.75078944	6.371900528
       32 threads
      	minor-faults		134,228,222	67,526,810	17,725,386	9,716,537	4,763,731	1,668,921	537,200
      	time, seconds		49.164430543	29.712060103	12.938649729	10.175151004	11.840094583	9.594081325	9.928461797
       60 threads
      	minor-faults		251,687,988	126,146,952	32,919,406	18,208,804	10,458,947	2,733,907	928,217
      	time, seconds		86.260656897	49.626551828	22.335007632	17.608243696	16.523119035	16.339489186	16.326390902
       120 threads
      	minor-faults		503,352,863	252,939,677	67,039,168	35,191,827	19,170,091	4,688,357	1,471,862
      	time, seconds		124.589206333	79.757867787	39.508707872	32.167281632	29.972989292	28.729834575	28.042251622
      Random access 1GiB file
       1 thread
      	minor-faults		262,636		132,743		34,369		17,299		8,527		3,451		1,222
      	time, seconds		15.351890914	16.613802482	16.569227308	15.179220992	16.557356122	16.578247824	15.365266994
       8 threads
      	minor-faults		2,098,948	1,061,871	273,690		154,501		87,110		25,663		7,384
      	time, seconds		15.040026343	15.096933500	14.474757288	14.289129964	14.411537468	14.296316837	14.395635804
       32 threads
      	minor-faults		8,390,734	4,231,023	1,054,432	528,847		269,242		97,746		26,881
      	time, seconds		20.430433109	21.585235358	22.115062928	14.872878951	14.880856305	14.883370649	14.821261690
       60 threads
      	minor-faults		15,733,258	7,892,809	1,973,393	988,266		594,789		164,994		51,691
      	time, seconds		26.577302548	25.692397770	18.728863715	20.153026398	21.619101933	17.745086260	17.613215273
       120 threads
      	minor-faults		31,471,111	15,816,616	3,959,209	1,978,685	1,008,299	264,635		96,010
      	time, seconds		41.835322703	40.459786095	36.085306105	35.313894834	35.814445675	36.552633793	34.289210594
      
      Touch only one page in page table in 16GiB file
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
       1 thread
      	minor-faults		8,372		8,324		8,270		8,260		8,249		8,239		8,237
      	time, seconds		0.039892712	0.045369149	0.051846126	0.063681685	0.079095975	0.17652406	0.541213386
       8 threads
      	minor-faults		65,731		65,681		65,628		65,620		65,608		65,599		65,596
      	time, seconds		0.124159196	0.488600638	0.156854426	0.191901957	0.242631486	0.543569456	1.677303984
       32 threads
      	minor-faults		262,388		262,341		262,285		262,276		262,266		262,257		263,183
      	time, seconds		0.452421421	0.488600638	0.565020946	0.648229739	0.789850823	1.651584361	5.000361559
       60 threads
      	minor-faults		491,822		491,792		491,723		491,711		491,701		491,691		491,825
      	time, seconds		0.763288616	0.869620515	0.980727360	1.161732354	1.466915814	3.04041448	9.308612938
       120 threads
      	minor-faults		983,466		983,655		983,366		983,372		983,363		984,083		984,164
      	time, seconds		1.595846553	1.667902182	2.008959376	2.425380942	2.941368804	5.977807890	18.401846125
      
      This patch (of 2):
      
      Introduce new vm_ops callback ->map_pages() and uses it for mapping easy
      accessible pages around fault address.
      
      On read page fault, if filesystem provides ->map_pages(), we try to map up
      to FAULT_AROUND_PAGES pages around page fault address in hope to reduce
      number of minor page faults.
      
      We call ->map_pages first and use ->fault() as fallback if page by the
      offset is not ready to be mapped (cold page cache or something).
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c6e50b0
    • Andrew Morton's avatar
      drivers/lguest/page_tables.c: rename do_set_pte() · 179e0963
      Andrew Morton authored
      
      
      "mm: introduce vm_ops->map_pages()" wants to export a do_set_pte() from core
      kernel.  Rename lguest's do_set_pte() to something more lguest-specific.
      
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      179e0963
    • Konstantin Khlebnikov's avatar
      tools/vm/page-types.c: page-cache sniffing feature · 65a6a410
      Konstantin Khlebnikov authored
      
      
      After this patch 'page-types' can walk over a file's mappings and
      analyze populated page cache pages mostly without disturbing its state.
      
      It maps chunk of file, marks VMA as MADV_RANDOM to turn off readahead,
      pokes VMA via mincore() to determine cached pages, triggers page-fault
      only for them, and finally gathers information via pagemap/kpageflags.
      Before unmap it marks VMA as MADV_SEQUENTIAL for ignoring reference
      bits.
      
      usage: page-types -f <path>
      
      If <path> is directory it will analyse all files in all subdirectories.
      
      Symlinks are not followed as well as mount points.  Hardlinks aren't
      handled, they'll be dumped as many times as they are found.  Recursive
      walk brings all dentries into dcache and populates page cache of
      block-devices aka 'Buffers'.
      
      Probably it's worth to add ioctl for dumping file page cache as array of
      PFNs as a replacement for this hackish juggling with
      mmap/madvise/mincore/pagemap.  Also recursive walk could be replaced
      with dumping cached inodes via some ioctl or debugfs interface followed
      by openning them via open_by_handle_at, this would fix hardlinks
      handling and unneeded population of dcache and buffers.  This interface
      might be used as data source for constructing readahead plans and for
      background optimizations of actively used files.
      
      collateral changes:
      + fix 64-bit LFS: define _FILE_OFFSET_BITS instead of _LARGEFILE64_SOURCE
      + replace lseek + read with single pread
      + make show_page_range() reusable after flush
      
      usage example:
      
        ~/src/linux/tools/vm$ sudo ./page-types -L -f page-types
        foffset offset    flags
        page-types       Inode: 2229277       Size: 89065 (22 pages)
        Modify: Tue Feb 25 12:00:59 2014 (162 seconds ago)
        Access: Tue Feb 25 12:01:00 2014 (161 seconds ago)
        0       3cbf3b     __RU_lA____M________________________
        1       38946a     __RU_lA____M________________________
        2       1a3cec     __RU_lA____M________________________
        3       1a8321     __RU_lA____M________________________
        4       3af7cc     __RU_lA____M________________________
        5       1ed532     __RU_lA_____________________________
        6       2e436a     __RU_lA_____________________________
        7       29a35e     ___U_lA_____________________________
        8       2de86e     ___U_lA_____________________________
        9       3bdfb4     ___U_lA_____________________________
        10      3cd8a3     ___U_lA_____________________________
        11      2afa50     ___U_lA_____________________________
        12      2534c2     ___U_lA_____________________________
        13      1b7a40     ___U_lA_____________________________
        14      17b0be     ___U_lA_____________________________
        15      392b0c     ___U_lA_____________________________
        16      3ba46a     __RU_lA_____________________________
        17      397dc8     ___U_lA_____________________________
        18      1f2a36     ___U_lA_____________________________
        19      21fd30     __RU_lA_____________________________
        20      2c35ba     __RU_l______________________________
        21      20f181     __RU_l______________________________
      
                     flags page-count   MB  symbolic-flags                        long-symbolic-flags
        0x000000000000002c          2    0  __RU_l______________________________  referenced,uptodate,lru
        0x0000000000000068         11    0  ___U_lA_____________________________  uptodate,lru,active
        0x000000000000006c          4    0  __RU_lA_____________________________  referenced,uptodate,lru,active
        0x000000000000086c          5    0  __RU_lA____M________________________  referenced,uptodate,lru,active,mmap
                     total         22    0
      
        ~/src/linux/tools/vm$ sudo ./page-types -f /
                     flags page-count     MB  symbolic-flags                        long-symbolic-flags
        0x0000000000000028      21761     85  ___U_l______________________________  uptodate,lru
        0x000000000000002c     127279    497  __RU_l______________________________  referenced,uptodate,lru
        0x0000000000000068      74160    289  ___U_lA_____________________________  uptodate,lru,active
        0x000000000000006c      84469    329  __RU_lA_____________________________  referenced,uptodate,lru,active
        0x000000000000007c          1      0  __RUDlA_____________________________  referenced,uptodate,dirty,lru,active
        0x0000000000000228        370      1  ___U_l___I__________________________  uptodate,lru,reclaim
        0x0000000000000828         49      0  ___U_l_____M________________________  uptodate,lru,mmap
        0x000000000000082c        126      0  __RU_l_____M________________________  referenced,uptodate,lru,mmap
        0x0000000000000868        137      0  ___U_lA____M________________________  uptodate,lru,active,mmap
        0x000000000000086c      12890     50  __RU_lA____M________________________  referenced,uptodate,lru,active,mmap
                     total     321242   1254
      
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      65a6a410
    • Kirill A. Shutemov's avatar
      mm: disable split page table lock for !MMU · 9164550e
      Kirill A. Shutemov authored
      
      
      There's no reason to enable split page table lock if don't have page
      tables.
      
      It also triggers build error at least on ARM since we don't define
      pmd_page() for !MMU.
      
        In file included from arch/arm/kernel/asm-offsets.c:14:0:
        include/linux/mm.h: In function 'pte_lockptr':
        include/linux/mm.h:1392:2: error: implicit declaration of function 'pmd_page' [-Werror=implicit-function-declaration]
        include/linux/mm.h:1392:2: warning: passing argument 1 of 'ptlock_ptr' makes pointer from integer without a cast [enabled by default]
        include/linux/mm.h:1384:27: note: expected 'struct page *' but argument is of type 'int'
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9164550e
    • Alex Thorlton's avatar
      exec: kill the unnecessary mm->def_flags setting in load_elf_binary() · ab0e113f
      Alex Thorlton authored
      
      
      load_elf_binary() sets current->mm->def_flags = def_flags and def_flags
      is always zero.  Not only this looks strange, this is unnecessary
      because mm_init() has already set ->def_flags = 0.
      
      Signed-off-by: default avatarAlex Thorlton <athorlton@sgi.com>
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab0e113f
    • Alex Thorlton's avatar
      mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE · a0715cc2
      Alex Thorlton authored
      
      
      Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs.  It
      also adds a prctl control which allows us to set the THP disable bit in
      mm->def_flags so that VMs will pick up the setting as they are created.
      
      Signed-off-by: default avatarAlex Thorlton <athorlton@sgi.com>
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a0715cc2
    • Alex Thorlton's avatar
      mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags" · 1e1836e8
      Alex Thorlton authored
      The main motivation behind this patch is to provide a way to disable THP
      for jobs where the code cannot be modified, and using a malloc hook with
      madvise is not an option (i.e.  statically allocated data).  This patch
      allows us to do just that, without affecting other jobs running on the
      system.
      
      We need to do this sort of thing for jobs where THP hurts performance,
      due to the possibility of increased remote memory accesses that can be
      created by situations such as the following:
      
      When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will
      be handed out, and the THP will be stuck on whatever node the chunk was
      originally referenced from.  If many remote nodes need to do work on
      that same chunk, they'll be making remote accesses.
      
      With THP disabled, 4K pages can be handed out to separate nodes as
      they're needed, greatly reducing the amount of remote accesses to
      memory.
      
      This patch is based on some of my work combined with some
      suggestions/patches given by Oleg Nesterov.  The main goal here is to
      add a prctl switch to allow us to disable to THP on a per mm_struct
      basis.
      
      Here's a bit of test data with the new patch in place...
      
      First with the flag unset:
      
        # perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
        Setting thp_disabled for this task...
        thp_disable: 0
        Set thp_disabled state to 0
        Process pid = 18027
      
                                                                                                                             PF/
                                        MAX        MIN                                  TOTCPU/      TOT_PF/   TOT_PF/     WSEC/
        TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU       CPU     WALL_SEC   SYS_SEC       CPU   NODES
         512      1.120      0.060      0.000    0.110      0.110     0.000    28571428864 -9223372036854775808  55803572      23
      
         Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':
      
          273719072.841402 task-clock                #  641.026 CPUs utilized           [100.00%]
                 1,008,986 context-switches          #    0.000 M/sec                   [100.00%]
                     7,717 CPU-migrations            #    0.000 M/sec                   [100.00%]
                 1,698,932 page-faults               #    0.000 M/sec
        355,222,544,890,379 cycles                   #    1.298 GHz                     [100.00%]
        536,445,412,234,588 stalled-cycles-frontend  #  151.02% frontend cycles idle    [100.00%]
        409,110,531,310,223 stalled-cycles-backend   #  115.17% backend  cycles idle    [100.00%]
        148,286,797,266,411 instructions             #    0.42  insns per cycle
                                                     #    3.62  stalled cycles per insn [100.00%]
        27,061,793,159,503 branches                  #   98.867 M/sec                   [100.00%]
             1,188,655,196 branch-misses             #    0.00% of all branches
      
             427.001706337 seconds time elapsed
      
      Now with the flag set:
      
        # perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
        Setting thp_disabled for this task...
        thp_disable: 1
        Set thp_disabled state to 1
        Process pid = 144957
      
                                                                                                                             PF/
                                        MAX        MIN                                  TOTCPU/      TOT_PF/   TOT_PF/     WSEC/
        TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU       CPU     WALL_SEC   SYS_SEC       CPU   NODES
         512      0.620      0.260      0.250    0.320      0.570     0.001    51612901376 128000000000 100806448      23
      
         Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':
      
          138789390.540183 task-clock                #  641.959 CPUs utilized           [100.00%]
                   534,205 context-switches          #    0.000 M/sec                   [100.00%]
                     4,595 CPU-migrations            #    0.000 M/sec                   [100.00%]
                63,133,119 page-faults               #    0.000 M/sec
        147,977,747,269,768 cycles                   #    1.066 GHz                     [100.00%]
        200,524,196,493,108 stalled-cycles-frontend  #  135.51% frontend cycles idle    [100.00%]
        105,175,163,716,388 stalled-cycles-backend   #   71.07% backend  cycles idle    [100.00%]
        180,916,213,503,160 instructions             #    1.22  insns per cycle
                                                     #    1.11  stalled cycles per insn [100.00%]
        26,999,511,005,868 branches                  #  194.536 M/sec                   [100.00%]
               714,066,351 branch-misses             #    0.00% of all branches
      
             216.196778807 seconds time elapsed
      
      As with previous versions of the patch, We're getting about a 2x
      performance increase here.  Here's a link to the test case I used, along
      with the little wrapper to activate the flag:
      
        http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz
      
      This patch (of 3):
      
      Revert commit 8e72033f and add in code to fix up any issues caused
      by the revert.
      
      The revert is necessary because hugepage_madvise would return -EINVAL
      when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
      patch set.
      
      Here's a snip of an e-mail from Gerald detailing the original purpose of
      this code, and providing justification for the revert:
      
        "The intent of commit 8e72033f was to guard against any future
         programming errors that may result in an madvice(MADV_HUGEPAGE) on
         guest mappings, which would crash the kernel.
      
         Martin suggested adding the bit to arch/s390/mm/pgtable.c, if
         8e72033f
      
       was to be reverted, because that check will also prevent
         a kernel crash in the case described above, it will now send a
         SIGSEGV instead.
      
         This would now also allow to do the madvise on other parts, if
         needed, so it is a more flexible approach.  One could also say that
         it would have been better to do it this way right from the
         beginning..."
      
      Signed-off-by: default avatarAlex Thorlton <athorlton@sgi.com>
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Tested-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e1836e8
    • Joonsoo Kim's avatar
      mm/compaction: clean-up code on success of ballon isolation · b6c75016
      Joonsoo Kim authored
      
      
      It is just for clean-up to reduce code size and improve readability.
      There is no functional change.
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6c75016
    • Joonsoo Kim's avatar
      mm/compaction: check pageblock suitability once per pageblock · c122b208
      Joonsoo Kim authored
      
      
      isolation_suitable() and migrate_async_suitable() is used to be sure
      that this pageblock range is fine to be migragted.  It isn't needed to
      call it on every page.  Current code do well if not suitable, but, don't
      do well when suitable.
      
      1) It re-checks isolation_suitable() on each page of a pageblock that was
         already estabilished as suitable.
      2) It re-checks migrate_async_suitable() on each page of a pageblock that
         was not entered through the next_pageblock: label, because
         last_pageblock_nr is not otherwise updated.
      
      This patch fixes situation by 1) calling isolation_suitable() only once
      per pageblock and 2) always updating last_pageblock_nr to the pageblock
      that was just checked.
      
      Additionally, move PageBuddy() check after pageblock unit check, since
      pageblock check is the first thing we should do and makes things more
      simple.
      
      [vbabka@suse.cz: rephrase commit description]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c122b208
    • Joonsoo Kim's avatar
      mm/compaction: change the timing to check to drop the spinlock · be1aa03b
      Joonsoo Kim authored
      
      
      It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th
      pfn page.  This may results in below situation while isolating
      migratepage.
      
      1. try isolate 0x0 ~ 0x200 pfn pages.
      2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop
         the spinlock.
      3. Then, to complete isolating, retry to aquire the lock.
      
      I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the
      criteria about dropping the lock.  This has no harm 0x0 pfn, because, at
      this time, locked variable would be false.
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be1aa03b
    • Joonsoo Kim's avatar
      mm/compaction: do not call suitable_migration_target() on every page · 01ead534
      Joonsoo Kim authored
      
      
      suitable_migration_target() checks that pageblock is suitable for
      migration target.  In isolate_freepages_block(), it is called on every
      page and this is inefficient.  So make it called once per pageblock.
      
      suitable_migration_target() also checks if page is highorder or not, but
      it's criteria for highorder is pageblock order.  So calling it once
      within pageblock range has no problem.
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01ead534
    • Joonsoo Kim's avatar
      mm/compaction: disallow high-order page for migration target · 7d348b9e
      Joonsoo Kim authored
      
      
      Purpose of compaction is to get a high order page.  Currently, if we
      find high-order page while searching migration target page, we break it
      to order-0 pages and use them as migration target.  It is contrary to
      purpose of compaction, so disallow high-order page to be used for
      migration target.
      
      Additionally, clean-up logic in suitable_migration_target() to simplify
      the code.  There is no functional changes from this clean-up.
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d348b9e
    • Michal Hocko's avatar
      mm: exclude memoryless nodes from zone_reclaim · 70ef57e6
      Michal Hocko authored
      
      
      We had a report about strange OOM killer strikes on a PPC machine
      although there was a lot of swap free and a tons of anonymous memory
      which could be swapped out.  In the end it turned out that the OOM was a
      side effect of zone reclaim which wasn't unmapping and swapping out and
      so the system was pushed to the OOM.  Although this sounds like a bug
      somewhere in the kswapd vs.  zone reclaim vs.  direct reclaim
      interaction numactl on the said hardware suggests that the zone reclaim
      should not have been set in the first place:
      
        node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
        node 0 size: 0 MB
        node 0 free: 0 MB
        node 2 cpus:
        node 2 size: 7168 MB
        node 2 free: 6019 MB
        node distances:
        node   0   2
        0:  10  40
        2:  40  10
      
      So all the CPUs are associated with Node0 which doesn't have any memory
      while Node2 contains all the available memory.  Node distances cause an
      automatic zone_reclaim_mode enabling.
      
      Zone reclaim is intended to keep the allocations local but this doesn't
      make any sense on the memoryless nodes.  So let's exclude such nodes for
      init_zone_allows_reclaim which evaluates zone reclaim behavior and
      suitable reclaim_nodes.
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Tested-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70ef57e6
    • Davidlohr Bueso's avatar
      mm/memory.c: update comment in unmap_single_vma() · 7aa6b4ad
      Davidlohr Bueso authored
      
      
      The described issue now occurs inside mmap_region().  And unfortunately
      is still valid.
      
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7aa6b4ad
    • Weijie Yang's avatar
      mm/vmscan: do not check compaction_ready on promoted zones · 9bbc04ee
      Weijie Yang authored
      
      
      We abort direct reclaim if we find the zone is ready for compaction.
      Sometimes the zone is just a promoted highmem zone to force a scan of
      highmem, which is not the intended zone the caller want to allocate a
      page from.  In this situation, setting aborted_reclaim to indicate the
      caller turned back to retry the allocation is waste of time and could
      cause a loop in __alloc_pages_slowpath().
      
      This patch does not check compaction_ready() on promoted zones to avoid
      the above situation.  Only set aborted_reclaim if the caller intended
      zone is ready for compaction.
      
      Signed-off-by: default avatarWeijie Yang <weijie.yang@samsung.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9bbc04ee