Skip to content
  1. Jul 29, 2016
    • Mel Gorman's avatar
      mm, vmscan: Have kswapd reclaim from all zones if reclaiming and buffer_heads_over_limit · 84c7a777
      Mel Gorman authored
      
      
      The buffer_heads_over_limit limit in kswapd is inconsistent with direct
      reclaim behaviour.  It may force an an attempt to reclaim from all zones
      and then not reclaim at all because higher zones were balanced than
      required by the original request.
      
      This patch will causes kswapd to consider reclaiming from all zones if
      buffer_heads_over_limit.  However, if there are eligible zones for the
      allocation request that woke kswapd then no reclaim will occur even if
      buffer_heads_over_limit.  This avoids kswapd over-reclaiming just
      because buffer_heads_over_limit.
      
      [mgorman@techsingularity.net: fix comment about buffer_heads_over_limit]
        Link: http://lkml.kernel.org/r/1468404004-5085-2-git-send-email-mgorman@techsingularity.net
      Link: http://lkml.kernel.org/r/1467970510-21195-28-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84c7a777
    • Mel Gorman's avatar
      mm, vmscan: avoid passing in `remaining' unnecessarily to prepare_kswapd_sleep() · d9f21d42
      Mel Gorman authored
      
      
      As pointed out by Minchan Kim, the first call to prepare_kswapd_sleep()
      always passes in 0 for `remaining' and the second call can trivially
      check the parameter in advance.
      
      Suggested-by: default avatarMinchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/1467970510-21195-27-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9f21d42
    • Mel Gorman's avatar
      mm, vmscan: avoid passing in classzone_idx unnecessarily to compaction_ready · 4f588331
      Mel Gorman authored
      
      
      The scan_control structure has enough information available for
      compaction_ready() to make a decision.  The classzone_idx manipulations
      in shrink_zones() are no longer necessary as the highest populated zone
      is no longer used to determine if shrink_slab should be called or not.
      
      [mgorman@techsingularity.net remove redundant check in shrink_zones()]
        Link: http://lkml.kernel.org/r/1468588165-12461-3-git-send-email-mgorman@techsingularity.net
      Link: http://lkml.kernel.org/r/1467970510-21195-26-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f588331
    • Mel Gorman's avatar
      mm, vmscan: avoid passing in classzone_idx unnecessarily to shrink_node · 970a39a3
      Mel Gorman authored
      
      
      shrink_node receives all information it needs about classzone_idx from
      sc->reclaim_idx so remove the aliases.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-25-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      970a39a3
    • Mel Gorman's avatar
      mm: convert zone_reclaim to node_reclaim · a5f5f91d
      Mel Gorman authored
      
      
      As reclaim is now per-node based, convert zone_reclaim to be
      node_reclaim.  It is possible that a node will be reclaimed multiple
      times if it has multiple zones but this is unavoidable without caching
      all nodes traversed so far.  The documentation and interface to
      userspace is the same from a configuration perspective and will will be
      similar in behaviour unless the node-local allocation requests were also
      limited to lower zones.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5f5f91d
    • Mel Gorman's avatar
      mm, page_alloc: wake kswapd based on the highest eligible zone · 52e9f87a
      Mel Gorman authored
      
      
      The ac_classzone_idx is used as the basis for waking kswapd and that is
      based on the preferred zoneref.  If the preferred zoneref's first zone
      is lower than what is available on other nodes, it's possible that
      kswapd is woken on a zone with only higher, but still eligible, zones.
      As classzone_idx is strictly adhered to now, it causes a problem because
      eligible pages are skipped.
      
      For example, node 0 has only DMA32 and node 1 has only NORMAL.  An
      allocating context running on node 0 may wake kswapd on node 1 telling
      it to skip all NORMAL pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-23-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52e9f87a
    • Mel Gorman's avatar
      mm, vmscan: only wakeup kswapd once per node for the requested classzone · e1a55637
      Mel Gorman authored
      
      
      kswapd is woken when zones are below the low watermark but the wakeup
      decision is not taking the classzone into account.  Now that reclaim is
      node-based, it is only required to wake kswapd once per node and only if
      all zones are unbalanced for the requested classzone.
      
      Note that one node might be checked multiple times if the zonelist is
      ordered by node because there is no cheap way of tracking what nodes
      have already been visited.  For zone-ordering, each node should be
      checked only once.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-22-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1a55637
    • Mel Gorman's avatar
      mm: move vmscan writes and file write accounting to the node · c4a25635
      Mel Gorman authored
      
      
      As reclaim is now node-based, it follows that page write activity due to
      page reclaim should also be accounted for on the node.  For consistency,
      also account page writes and page dirtying on a per-node basis.
      
      After this patch, there are a few remaining zone counters that may appear
      strange but are fine.  NUMA stats are still per-zone as this is a
      user-space interface that tools consume.  NR_MLOCK, NR_SLAB_*,
      NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
      potentially pin low memory and cannot trivially be reclaimed on demand.
      This information is still useful for debugging a page allocation failure
      warning.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-21-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4a25635
    • Mel Gorman's avatar
      mm: move most file-based accounting to the node · 11fb9989
      Mel Gorman authored
      
      
      There are now a number of accounting oddities such as mapped file pages
      being accounted for on the node while the total number of file pages are
      accounted on the zone.  This can be coped with to some extent but it's
      confusing so this patch moves the relevant file-based accounted.  Due to
      throttling logic in the page allocator for reliable OOM detection, it is
      still necessary to track dirty and writeback pages on a per-zone basis.
      
      [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
        Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
      Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minch...
      11fb9989
    • Mel Gorman's avatar
      mm: rename NR_ANON_PAGES to NR_ANON_MAPPED · 4b9d0fab
      Mel Gorman authored
      
      
      NR_FILE_PAGES  is the number of        file pages.
      NR_FILE_MAPPED is the number of mapped file pages.
      NR_ANON_PAGES  is the number of mapped anon pages.
      
      This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
      NR_ANON_PAGES for mapped pages.  This patch renames NR_ANON_PAGES so we
      have
      
      NR_FILE_PAGES  is the number of        file pages.
      NR_FILE_MAPPED is the number of mapped file pages.
      NR_ANON_MAPPED is the number of mapped anon pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-19-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b9d0fab
    • Mel Gorman's avatar
      mm: move page mapped accounting to the node · 50658e2e
      Mel Gorman authored
      
      
      Reclaim makes decisions based on the number of pages that are mapped but
      it's mixing node and zone information.  Account NR_FILE_MAPPED and
      NR_ANON_PAGES pages on the node.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50658e2e
    • Mel Gorman's avatar
      mm, page_alloc: consider dirtyable memory in terms of nodes · 281e3726
      Mel Gorman authored
      
      
      Historically dirty pages were spread among zones but now that LRUs are
      per-node it is more appropriate to consider dirty pages in a node.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-17-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      281e3726
    • Mel Gorman's avatar
      mm, workingset: make working set detection node-aware · 1e6b1085
      Mel Gorman authored
      
      
      Working set and refault detection is still zone-based, fix it.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-16-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e6b1085
    • Mel Gorman's avatar
      mm, memcg: move memcg limit enforcement from zones to nodes · ef8f2327
      Mel Gorman authored
      
      
      Memcg needs adjustment after moving LRUs to the node.  Limits are
      tracked per memcg but the soft-limit excess is tracked per zone.  As
      global page reclaim is based on the node, it is easy to imagine a
      situation where a zone soft limit is exceeded even though the memcg
      limit is fine.
      
      This patch moves the soft limit tree the node.  Technically, all the
      variable names should also change but people are already familiar by the
      meaning of "mz" even if "mn" would be a more appropriate name now.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef8f2327
    • Mel Gorman's avatar
      mm, vmscan: make shrink_node decisions more node-centric · a9dd0a83
      Mel Gorman authored
      
      
      Earlier patches focused on having direct reclaim and kswapd use data
      that is node-centric for reclaiming but shrink_node() itself still uses
      too much zone information.  This patch removes unnecessary zone-based
      information with the most important decision being whether to continue
      reclaim or not.  Some memcg APIs are adjusted as a result even though
      memcg itself still uses some zone information.
      
      [mgorman@techsingularity.net: optimization]
        Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
      Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9dd0a83
    • Mel Gorman's avatar
      mm: vmscan: do not reclaim from kswapd if there is any eligible zone · 86c79f6b
      Mel Gorman authored
      
      
      kswapd scans from highest to lowest for a zone that requires balancing.
      This was necessary when reclaim was per-zone to fairly age pages on
      lower zones.  Now that we are reclaiming on a per-node basis, any
      eligible zone can be used and pages will still be aged fairly.  This
      patch avoids reclaiming excessively unless buffer_heads are over the
      limit and it's necessary to reclaim from a higher zone than requested by
      the waker of kswapd to relieve low memory pressure.
      
      [hillf.zj@alibaba-inc.com: Force kswapd reclaim no more than needed]
      Link: http://lkml.kernel.org/r/1466518566-30034-12-git-send-email-mgorman@techsingularity.net
      Link: http://lkml.kernel.org/r/1467970510-21195-13-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86c79f6b
    • Mel Gorman's avatar
      mm, vmscan: remove duplicate logic clearing node congestion and dirty state · 6256c6b4
      Mel Gorman authored
      
      
      Reclaim may stall if there is too much dirty or congested data on a
      node.  This was previously based on zone flags and the logic for
      clearing the flags is in two places.  As congestion/dirty tracking is
      now tracked on a per-node basis, we can remove some duplicate logic.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-12-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6256c6b4
    • Mel Gorman's avatar
      mm, vmscan: by default have direct reclaim only shrink once per node · 79dafcdc
      Mel Gorman authored
      
      
      Direct reclaim iterates over all zones in the zonelist and shrinking
      them but this is in conflict with node-based reclaim.  In the default
      case, only shrink once per node.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-11-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79dafcdc
    • Mel Gorman's avatar
      mm, vmscan: simplify the logic deciding whether kswapd sleeps · 38087d9b
      Mel Gorman authored
      
      
      kswapd goes through some complex steps trying to figure out if it should
      stay awake based on the classzone_idx and the requested order.  It is
      unnecessarily complex and passes in an invalid classzone_idx to
      balance_pgdat().  What matters most of all is whether a larger order has
      been requsted and whether kswapd successfully reclaimed at the previous
      order.  This patch irons out the logic to check just that and the end
      result is less headache inducing.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-10-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38087d9b
    • Mel Gorman's avatar
      mm, vmscan: remove balance gap · 31483b6a
      Mel Gorman authored
      
      
      The balance gap was introduced to apply equal pressure to all zones when
      reclaiming for a higher zone.  With node-based LRU, the need for the
      balance gap is removed and the code is dead so remove it.
      
      [vbabka@suse.cz: Also remove KSWAPD_ZONE_BALANCE_GAP_RATIO]
      Link: http://lkml.kernel.org/r/1467970510-21195-9-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31483b6a
    • Mel Gorman's avatar
      mm, vmscan: make kswapd reclaim in terms of nodes · 1d82de61
      Mel Gorman authored
      
      
      Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started
      thinking of reclaim in terms of nodes but kswapd is still zone-centric.
      This patch gets rid of many of the node-based versus zone-based
      decisions.
      
      o A node is considered balanced when any eligible lower zone is balanced.
        This eliminates one class of age-inversion problem because we avoid
        reclaiming a newer page just because it's in the wrong zone
      o pgdat_balanced disappears because we now only care about one zone being
        balanced.
      o Some anomalies related to writeback and congestion tracking being based on
        zones disappear.
      o kswapd no longer has to take care to reclaim zones in the reverse order
        that the page allocator uses.
      o Most importantly of all, reclaim from node 0 with multiple zones will
        have similar aging and reclaiming characteristics as every
        other node.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-8-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d82de61
    • Mel Gorman's avatar
      mm, vmscan: have kswapd only scan based on the highest requested zone · f7b60926
      Mel Gorman authored
      
      
      kswapd checks all eligible zones to see if they need balancing even if
      it was woken for a lower zone.  This made sense when we reclaimed on a
      per-zone basis because we wanted to shrink zones fairly so avoid
      age-inversion problems.  Ideally this is completely unnecessary when
      reclaiming on a per-node basis.  In theory, there may still be anomalies
      when all requests are for lower zones and very old pages are preserved
      in higher zones but this should be the exceptional case.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-7-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7b60926
    • Mel Gorman's avatar
      mm, vmscan: begin reclaiming pages on a per-node basis · b2e18757
      Mel Gorman authored
      
      
      This patch makes reclaim decisions on a per-node basis.  A reclaimer
      knows what zone is required by the allocation request and skips pages
      from higher zones.  In many cases this will be ok because it's a
      GFP_HIGHMEM request of some description.  On 64-bit, ZONE_DMA32 requests
      will cause some problems but 32-bit devices on 64-bit platforms are
      increasingly rare.  Historically it would have been a major problem on
      32-bit with big Highmem:Lowmem ratios but such configurations are also
      now rare and even where they exist, they are not encouraged.  If it
      really becomes a problem, it'll manifest as very low reclaim
      efficiencies.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-6-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2e18757
    • Mel Gorman's avatar
      mm, mmzone: clarify the usage of zone padding · 0f661148
      Mel Gorman authored
      
      
      Zone padding separates write-intensive fields used by page allocation,
      compaction and vmstats but the comments are a little misleading and need
      clarification.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-5-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f661148
    • Mel Gorman's avatar
      mm, vmscan: move LRU lists to node · 599d0c95
      Mel Gorman authored
      
      
      This moves the LRU lists from the zone to the node and related data such
      as counters, tracing, congestion tracking and writeback tracking.
      
      Unfortunately, due to reclaim and compaction retry logic, it is
      necessary to account for the number of LRU pages on both zone and node
      logic.  Most reclaim logic is based on the node counters but the retry
      logic uses the zone counters which do not distinguish inactive and
      active sizes.  It would be possible to leave the LRU counters on a
      per-zone basis but it's a heavier calculation across multiple cache
      lines that is much more frequent than the retry checks.
      
      Other than the LRU counters, this is mostly a mechanical patch but note
      that it introduces a number of anomalies.  For example, the scans are
      per-zone but using per-node counters.  We also mark a node as congested
      when a zone is congested.  This causes weird problems that are fixed
      later but is easier to review.
      
      In the event that there is excessive overhead on 32-bit systems due to
      the nodes being on LRU then there are two potential solutions
      
      1. Long-term isolation of highmem pages when reclaim is lowmem
      
         When pages are skipped, they are immediately added back onto the LRU
         list. If lowmem reclaim persisted for long periods of time, the same
         highmem pages get continually scanned. The idea would be that lowmem
         keeps those pages on a separate list until a reclaim for highmem pages
         arrives that splices the highmem pages back onto the LRU. It potentially
         could be implemented similar to the UNEVICTABLE list.
      
         That would reduce the skip rate with the potential corner case is that
         highmem pages have to be scanned and reclaimed to free lowmem slab pages.
      
      2. Linear scan lowmem pages if the initial LRU shrink fails
      
         This will break LRU ordering but may be preferable and faster during
         memory pressure than skipping LRU pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      599d0c95
    • Mel Gorman's avatar
      mm, vmscan: move lru_lock to the node · a52633d8
      Mel Gorman authored
      
      
      Node-based reclaim requires node-based LRUs and locking.  This is a
      preparation patch that just moves the lru_lock to the node so later
      patches are easier to review.  It is a mechanical change but note this
      patch makes contention worse because the LRU lock is hotter and direct
      reclaim and kswapd can contend on the same lock even when reclaiming
      from different zones.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a52633d8
    • Mel Gorman's avatar
      mm, vmstat: add infrastructure for per-node vmstats · 75ef7184
      Mel Gorman authored
      
      
      Patchset: "Move LRU page reclaim from zones to nodes v9"
      
      This series moves LRUs from the zones to the node.  While this is a
      current rebase, the test results were based on mmotm as of June 23rd.
      Conceptually, this series is simple but there are a lot of details.
      Some of the broad motivations for this are;
      
      1. The residency of a page partially depends on what zone the page was
         allocated from.  This is partially combatted by the fair zone allocation
         policy but that is a partial solution that introduces overhead in the
         page allocator paths.
      
      2. Currently, reclaim on node 0 behaves slightly different to node 1. For
         example, direct reclaim scans in zonelist order and reclaims even if
         the zone is over the high watermark regardless of the age of pages
         in that LRU. Kswapd on the other hand starts reclaim on the highest
         unbalanced zone. A difference in distribution of file/anon pages due
         to when they were allocated results can result in a difference in
         again. While the fair zone allocation policy mitigates some of the
         problems here, the page reclaim results on a multi-zone node will
         always be different to a single-zone node.
         it was scheduled on as a result.
      
      3. kswapd and the page allocator scan zones in the opposite order to
         avoid interfering with each other but it's sensitive to timing.  This
         mitigates the page allocator using pages that were allocated very recently
         in the ideal case but it's sensitive to timing. When kswapd is allocating
         from lower zones then it's great but during the rebalancing of the highest
         zone, the page allocator and kswapd interfere with each other. It's worse
         if the highest zone is small and difficult to balance.
      
      4. slab shrinkers are node-based which makes it harder to identify the exact
         relationship between slab reclaim and LRU reclaim.
      
      The reason we have zone-based reclaim is that we used to have
      large highmem zones in common configurations and it was necessary
      to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
      less of a concern as machines with lots of memory will (or should) use
      64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
      rare. Machines that do use highmem should have relatively low highmem:lowmem
      ratios than we worried about in the past.
      
      Conceptually, moving to node LRUs should be easier to understand. The
      page allocator plays fewer tricks to game reclaim and reclaim behaves
      similarly on all nodes.
      
      The series has been tested on a 16 core UMA machine and a 2-socket 48
      core NUMA machine. The UMA results are presented in most cases as the NUMA
      machine behaved similarly.
      
      pagealloc
      ---------
      
      This is a microbenchmark that shows the benefit of removing the fair zone
      allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
      shown as the other orders were comparable.
      
                                                 4.7.0-rc4                  4.7.0-rc4
                                            mmotm-20160623                 nodelru-v9
      Min      total-odr0-1               490.00 (  0.00%)           457.00 (  6.73%)
      Min      total-odr0-2               347.00 (  0.00%)           329.00 (  5.19%)
      Min      total-odr0-4               288.00 (  0.00%)           273.00 (  5.21%)
      Min      total-odr0-8               251.00 (  0.00%)           239.00 (  4.78%)
      Min      total-odr0-16              234.00 (  0.00%)           222.00 (  5.13%)
      Min      total-odr0-32              223.00 (  0.00%)           211.00 (  5.38%)
      Min      total-odr0-64              217.00 (  0.00%)           208.00 (  4.15%)
      Min      total-odr0-128             214.00 (  0.00%)           204.00 (  4.67%)
      Min      total-odr0-256             250.00 (  0.00%)           230.00 (  8.00%)
      Min      total-odr0-512             271.00 (  0.00%)           269.00 (  0.74%)
      Min      total-odr0-1024            291.00 (  0.00%)           282.00 (  3.09%)
      Min      total-odr0-2048            303.00 (  0.00%)           296.00 (  2.31%)
      Min      total-odr0-4096            311.00 (  0.00%)           309.00 (  0.64%)
      Min      total-odr0-8192            316.00 (  0.00%)           314.00 (  0.63%)
      Min      total-odr0-16384           317.00 (  0.00%)           315.00 (  0.63%)
      Min      total-odr1-1               742.00 (  0.00%)           712.00 (  4.04%)
      Min      total-odr1-2               562.00 (  0.00%)           530.00 (  5.69%)
      Min      total-odr1-4               457.00 (  0.00%)           433.00 (  5.25%)
      Min      total-odr1-8               411.00 (  0.00%)           381.00 (  7.30%)
      Min      total-odr1-16              381.00 (  0.00%)           356.00 (  6.56%)
      Min      total-odr1-32              372.00 (  0.00%)           346.00 (  6.99%)
      Min      total-odr1-64              372.00 (  0.00%)           343.00 (  7.80%)
      Min      total-odr1-128             375.00 (  0.00%)           351.00 (  6.40%)
      Min      total-odr1-256             379.00 (  0.00%)           351.00 (  7.39%)
      Min      total-odr1-512             385.00 (  0.00%)           355.00 (  7.79%)
      Min      total-odr1-1024            386.00 (  0.00%)           358.00 (  7.25%)
      Min      total-odr1-2048            390.00 (  0.00%)           362.00 (  7.18%)
      Min      total-odr1-4096            390.00 (  0.00%)           362.00 (  7.18%)
      Min      total-odr1-8192            388.00 (  0.00%)           363.00 (  6.44%)
      
      This shows a steady improvement throughout. The primary benefit is from
      reduced system CPU usage which is obvious from the overall times;
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623nodelru-v8
      User          189.19      191.80
      System       2604.45     2533.56
      Elapsed      2855.30     2786.39
      
      The vmstats also showed that the fair zone allocation policy was definitely
      removed as can be seen here;
      
                                   4.7.0-rc3   4.7.0-rc3
                               mmotm-20160623 nodelru-v8
      DMA32 allocs               28794729769           0
      Normal allocs              48432501431 77227309877
      Movable allocs                       0           0
      
      tiobench on ext4
      ----------------
      
      tiobench is a benchmark that artifically benefits if old pages remain resident
      while new pages get reclaimed. The fair zone allocation policy mitigates this
      problem so pages age fairly. While the benchmark has problems, it is important
      that tiobench performance remains constant as it implies that page aging
      problems that the fair zone allocation policy fixes are not re-introduced.
      
                                               4.7.0-rc4             4.7.0-rc4
                                          mmotm-20160623            nodelru-v9
      Min      PotentialReadSpeed        89.65 (  0.00%)       90.21 (  0.62%)
      Min      SeqRead-MB/sec-1          82.68 (  0.00%)       82.01 ( -0.81%)
      Min      SeqRead-MB/sec-2          72.76 (  0.00%)       72.07 ( -0.95%)
      Min      SeqRead-MB/sec-4          75.13 (  0.00%)       74.92 ( -0.28%)
      Min      SeqRead-MB/sec-8          64.91 (  0.00%)       65.19 (  0.43%)
      Min      SeqRead-MB/sec-16         62.24 (  0.00%)       62.22 ( -0.03%)
      Min      RandRead-MB/sec-1          0.88 (  0.00%)        0.88 (  0.00%)
      Min      RandRead-MB/sec-2          0.95 (  0.00%)        0.92 ( -3.16%)
      Min      RandRead-MB/sec-4          1.43 (  0.00%)        1.34 ( -6.29%)
      Min      RandRead-MB/sec-8          1.61 (  0.00%)        1.60 ( -0.62%)
      Min      RandRead-MB/sec-16         1.80 (  0.00%)        1.90 (  5.56%)
      Min      SeqWrite-MB/sec-1         76.41 (  0.00%)       76.85 (  0.58%)
      Min      SeqWrite-MB/sec-2         74.11 (  0.00%)       73.54 ( -0.77%)
      Min      SeqWrite-MB/sec-4         80.05 (  0.00%)       80.13 (  0.10%)
      Min      SeqWrite-MB/sec-8         72.88 (  0.00%)       73.20 (  0.44%)
      Min      SeqWrite-MB/sec-16        75.91 (  0.00%)       76.44 (  0.70%)
      Min      RandWrite-MB/sec-1         1.18 (  0.00%)        1.14 ( -3.39%)
      Min      RandWrite-MB/sec-2         1.02 (  0.00%)        1.03 (  0.98%)
      Min      RandWrite-MB/sec-4         1.05 (  0.00%)        0.98 ( -6.67%)
      Min      RandWrite-MB/sec-8         0.89 (  0.00%)        0.92 (  3.37%)
      Min      RandWrite-MB/sec-16        0.92 (  0.00%)        0.93 (  1.09%)
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623 approx-v9
      User          645.72      525.90
      System        403.85      331.75
      Elapsed      6795.36     6783.67
      
      This shows that the series has little or not impact on tiobench which is
      desirable and a reduction in system CPU usage. It indicates that the fair
      zone allocation policy was removed in a manner that didn't reintroduce
      one class of page aging bug. There were only minor differences in overall
      reclaim activity
      
                                   4.7.0-rc4   4.7.0-rc4
                                mmotm-20160623nodelru-v8
      Minor Faults                    645838      647465
      Major Faults                       573         640
      Swap Ins                             0           0
      Swap Outs                            0           0
      DMA allocs                           0           0
      DMA32 allocs                  46041453    44190646
      Normal allocs                 78053072    79887245
      Movable allocs                       0           0
      Allocation stalls                   24          67
      Stall zone DMA                       0           0
      Stall zone DMA32                     0           0
      Stall zone Normal                    0           2
      Stall zone HighMem                   0           0
      Stall zone Movable                   0          65
      Direct pages scanned             10969       30609
      Kswapd pages scanned          93375144    93492094
      Kswapd pages reclaimed        93372243    93489370
      Direct pages reclaimed           10969       30609
      Kswapd efficiency                  99%         99%
      Kswapd velocity              13741.015   13781.934
      Direct efficiency                 100%        100%
      Direct velocity                  1.614       4.512
      Percentage direct scans             0%          0%
      
      kswapd activity was roughly comparable. There were differences in direct
      reclaim activity but negligible in the context of the overall workload
      (velocity of 4 pages per second with the patches applied, 1.6 pages per
      second in the baseline kernel).
      
      pgbench read-only large configuration on ext4
      ---------------------------------------------
      
      pgbench is a database benchmark that can be sensitive to page reclaim
      decisions. This also checks if removing the fair zone allocation policy
      is safe
      
      pgbench Transactions
                              4.7.0-rc4             4.7.0-rc4
                         mmotm-20160623            nodelru-v8
      Hmean    1       188.26 (  0.00%)      189.78 (  0.81%)
      Hmean    5       330.66 (  0.00%)      328.69 ( -0.59%)
      Hmean    12      370.32 (  0.00%)      380.72 (  2.81%)
      Hmean    21      368.89 (  0.00%)      369.00 (  0.03%)
      Hmean    30      382.14 (  0.00%)      360.89 ( -5.56%)
      Hmean    32      428.87 (  0.00%)      432.96 (  0.95%)
      
      Negligible differences again. As with tiobench, overall reclaim activity
      was comparable.
      
      bonnie++ on ext4
      ----------------
      
      No interesting performance difference, negligible differences on reclaim
      stats.
      
      paralleldd on ext4
      ------------------
      
      This workload uses varying numbers of dd instances to read large amounts of
      data from disk.
      
                                     4.7.0-rc3             4.7.0-rc3
                                mmotm-20160623            nodelru-v9
      Amean    Elapsd-1       186.04 (  0.00%)      189.41 ( -1.82%)
      Amean    Elapsd-3       192.27 (  0.00%)      191.38 (  0.46%)
      Amean    Elapsd-5       185.21 (  0.00%)      182.75 (  1.33%)
      Amean    Elapsd-7       183.71 (  0.00%)      182.11 (  0.87%)
      Amean    Elapsd-12      180.96 (  0.00%)      181.58 ( -0.35%)
      Amean    Elapsd-16      181.36 (  0.00%)      183.72 ( -1.30%)
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623 nodelru-v9
      User         1548.01     1552.44
      System       8609.71     8515.08
      Elapsed      3587.10     3594.54
      
      There is little or no change in performance but some drop in system CPU usage.
      
                                   4.7.0-rc3   4.7.0-rc3
                              mmotm-20160623  nodelru-v9
      Minor Faults                    362662      367360
      Major Faults                      1204        1143
      Swap Ins                            22           0
      Swap Outs                         2855        1029
      DMA allocs                           0           0
      DMA32 allocs                  31409797    28837521
      Normal allocs                 46611853    49231282
      Movable allocs                       0           0
      Direct pages scanned                 0           0
      Kswapd pages scanned          40845270    40869088
      Kswapd pages reclaimed        40830976    40855294
      Direct pages reclaimed               0           0
      Kswapd efficiency                  99%         99%
      Kswapd velocity              11386.711   11369.769
      Direct efficiency                 100%        100%
      Direct velocity                  0.000       0.000
      Percentage direct scans             0%          0%
      Page writes by reclaim            2855        1029
      Page writes file                     0           0
      Page writes anon                  2855        1029
      Page reclaim immediate             771        1628
      Sector Reads                 293312636   293536360
      Sector Writes                 18213568    18186480
      Page rescued immediate               0           0
      Slabs scanned                   128257      132747
      Direct inode steals                181          56
      Kswapd inode steals                 59        1131
      
      It basically shows that kswapd was active at roughly the same rate in
      both kernels. There was also comparable slab scanning activity and direct
      reclaim was avoided in both cases. There appears to be a large difference
      in numbers of inodes reclaimed but the workload has few active inodes and
      is likely a timing artifact.
      
      stutter
      -------
      
      stutter simulates a simple workload. One part uses a lot of anonymous
      memory, a second measures mmap latency and a third copies a large file.
      The primary metric is checking for mmap latency.
      
      stutter
                                   4.7.0-rc4             4.7.0-rc4
                              mmotm-20160623            nodelru-v8
      Min         mmap     16.6283 (  0.00%)     13.4258 ( 19.26%)
      1st-qrtle   mmap     54.7570 (  0.00%)     34.9121 ( 36.24%)
      2nd-qrtle   mmap     57.3163 (  0.00%)     46.1147 ( 19.54%)
      3rd-qrtle   mmap     58.9976 (  0.00%)     47.1882 ( 20.02%)
      Max-90%     mmap     59.7433 (  0.00%)     47.4453 ( 20.58%)
      Max-93%     mmap     60.1298 (  0.00%)     47.6037 ( 20.83%)
      Max-95%     mmap     73.4112 (  0.00%)     82.8719 (-12.89%)
      Max-99%     mmap     92.8542 (  0.00%)     88.8870 (  4.27%)
      Max         mmap   1440.6569 (  0.00%)    121.4201 ( 91.57%)
      Mean        mmap     59.3493 (  0.00%)     42.2991 ( 28.73%)
      Best99%Mean mmap     57.2121 (  0.00%)     41.8207 ( 26.90%)
      Best95%Mean mmap     55.9113 (  0.00%)     39.9620 ( 28.53%)
      Best90%Mean mmap     55.6199 (  0.00%)     39.3124 ( 29.32%)
      Best50%Mean mmap     53.2183 (  0.00%)     33.1307 ( 37.75%)
      Best10%Mean mmap     45.9842 (  0.00%)     20.4040 ( 55.63%)
      Best5%Mean  mmap     43.2256 (  0.00%)     17.9654 ( 58.44%)
      Best1%Mean  mmap     32.9388 (  0.00%)     16.6875 ( 49.34%)
      
      This shows a number of improvements with the worst-case outlier greatly
      improved.
      
      Some of the vmstats are interesting
      
                                   4.7.0-rc4   4.7.0-rc4
                                mmotm-20160623nodelru-v8
      Swap Ins                           163         502
      Swap Outs                            0           0
      DMA allocs                           0           0
      DMA32 allocs                 618719206  1381662383
      Normal allocs                891235743   564138421
      Movable allocs                       0           0
      Allocation stalls                 2603           1
      Direct pages scanned            216787           2
      Kswapd pages scanned          50719775    41778378
      Kswapd pages reclaimed        41541765    41777639
      Direct pages reclaimed          209159           0
      Kswapd efficiency                  81%         99%
      Kswapd velocity              16859.554   14329.059
      Direct efficiency                  96%          0%
      Direct velocity                 72.061       0.001
      Percentage direct scans             0%          0%
      Page writes by reclaim         6215049           0
      Page writes file               6215049           0
      Page writes anon                     0           0
      Page reclaim immediate           70673          90
      Sector Reads                  81940800    81680456
      Sector Writes                100158984    98816036
      Page rescued immediate               0           0
      Slabs scanned                  1366954       22683
      
      While this is not guaranteed in all cases, this particular test showed
      a large reduction in direct reclaim activity. It's also worth noting
      that no page writes were issued from reclaim context.
      
      This series is not without its hazards. There are at least three areas
      that I'm concerned with even though I could not reproduce any problems in
      that area.
      
      1. Reclaim/compaction is going to be affected because the amount of reclaim is
         no longer targetted at a specific zone. Compaction works on a per-zone basis
         so there is no guarantee that reclaiming a few THP's worth page pages will
         have a positive impact on compaction success rates.
      
      2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
         are called is now different. This may or may not be a problem but if it
         is, it'll be because shrinkers are not called enough and some balancing
         is required.
      
      3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
         distributed between zones and the fair zone allocation policy used to do
         something very similar for anon. The distribution is now different but not
         necessarily in any way that matters but it's still worth bearing in mind.
      
      VM statistic counters for reclaim decisions are zone-based.  If the kernel
      is to reclaim on a per-node basis then we need to track per-node
      statistics but there is no infrastructure for that.  The most notable
      change is that the old node_page_state is renamed to
      sum_zone_node_page_state.  The new node_page_state takes a pglist_data and
      uses per-node stats but none exist yet.  There is some renaming such as
      vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
      of mod_state to mod_zone_state.  Otherwise, this is mostly a mechanical
      patch with no functional change.  There is a lot of similarity between the
      node and zone helpers which is unfortunate but there was no obvious way of
      reusing the code and maintaining type safety.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75ef7184
    • Mel Gorman's avatar
      mm, meminit: remove early_page_nid_uninitialised · a621184a
      Mel Gorman authored
      The helper early_page_nid_uninitialised() has been dead since commit
      974a786e
      
       ("mm, page_alloc: remove MIGRATE_RESERVE") so remove the
      dead code.
      
      Link: http://lkml.kernel.org/r/1468008031-3848-2-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a621184a
    • Michal Hocko's avatar
      cpuset, mm: fix TIF_MEMDIE check in cpuset_change_task_nodemask · fec1e5f9
      Michal Hocko authored
      Commit c0ff7453
      
       ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") has added TIF_MEMDIE and PF_EXITING check but
      it is checking the flag on the current task rather than the given one.
      
      This doesn't make much sense and it is actually wrong.  If the current
      task which updates the nodemask of a cpuset got killed by the OOM killer
      then a part of the cpuset cgroup processes would have incompatible
      nodemask which is surprising to say the least.
      
      The comment suggests the intention was to skip oom victim or an exiting
      task so we should be checking the given task.  But even then it would be
      layering violation because it is the memory allocator to interpret the
      TIF_MEMDIE meaning.  Simply drop both checks.  All tasks in the cpuset
      should simply follow the same mask.
      
      Link: http://lkml.kernel.org/r/1467029719-17602-3-git-send-email-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: Miao Xie <miaoxie@huawei.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fec1e5f9
    • Michal Hocko's avatar
      freezer, oom: check TIF_MEMDIE on the correct task · a34c80a7
      Michal Hocko authored
      
      
      freezing_slow_path() is checking TIF_MEMDIE to skip OOM killed tasks.
      It is, however, checking the flag on the current task rather than the
      given one.  This is really confusing because freezing() can be called
      also on !current tasks.  It would end up working correctly for its main
      purpose because __refrigerator will be always called on the current task
      so the oom victim will never get frozen.  But it could lead to
      surprising results when a task which is freezing a cgroup got oom killed
      because only part of the cgroup would get frozen.  This is highly
      unlikely but worth fixing as the resulting code would be more clear
      anyway.
      
      Link: http://lkml.kernel.org/r/1467029719-17602-2-git-send-email-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: Miao Xie <miaoxie@huawei.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a34c80a7
    • Ganesh Mahendran's avatar
      mm/compaction: remove unnecessary order check in try_to_compact_pages() · b2b331f9
      Ganesh Mahendran authored
      
      
      The caller __alloc_pages_direct_compact() already checked (order == 0)
      so there's no need to check again.
      
      Link: http://lkml.kernel.org/r/1465973568-3496-1-git-send-email-opensource.ganesh@gmail.com
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2b331f9
    • Johannes Weiner's avatar
      mm: fix vm-scalability regression in cgroup-aware workingset code · 55779ec7
      Johannes Weiner authored
      Commit 23047a96 ("mm: workingset: per-cgroup cache thrash
      detection") added a page->mem_cgroup lookup to the cache eviction,
      refault, and activation paths, as well as locking to the activation
      path, and the vm-scalability tests showed a regression of -23%.
      
      While the test in question is an artificial worst-case scenario that
      doesn't occur in real workloads - reading two sparse files in parallel
      at full CPU speed just to hammer the LRU paths - there is still some
      optimizations that can be done in those paths.
      
      Inline the lookup functions to eliminate calls.  Also, page->mem_cgroup
      doesn't need to be stabilized when counting an activation; we merely
      need to hold the RCU lock to prevent the memcg from being freed.
      
      This cuts down on overhead quite a bit:
      
      23047a96
      
       063f6715e77a7be5770d6081fe
      ---------------- --------------------------
               %stddev     %change         %stddev
                   \          |                \
        21621405 +- 0%     +11.3%   24069657 +- 2%  vm-scalability.throughput
      
      [linux@roeck-us.net: drop unnecessary include file]
      [hannes@cmpxchg.org: add WARN_ON_ONCE()s]
        Link: http://lkml.kernel.org/r/20160707194024.GA26580@cmpxchg.org
      Link: http://lkml.kernel.org/r/20160624175101.GA3024@cmpxchg.org
      Reported-by: default avatarYe Xiaolong <xiaolong.ye@intel.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55779ec7
    • zhong jiang's avatar
      mm: update the comment in __isolate_free_page · 400bc7fd
      zhong jiang authored
      
      
      We need to assure the comment is consistent with the code.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1466171914-21027-1-git-send-email-zhongjiang@huawei.com
      Signed-off-by: default avatarzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      400bc7fd
    • Michal Hocko's avatar
      mm, oom: tighten task_will_free_mem() locking · 091f362c
      Michal Hocko authored
      
      
      "mm, oom: fortify task_will_free_mem" has dropped task_lock around
      task_will_free_mem in oom_kill_process bacause it assumed that a
      potential race when the selected task exits will not be a problem as the
      oom_reaper will call exit_oom_victim.
      
      Tetsuo was objecting that nommu doesn't have oom_reaper so the race
      would be still possible.  The code would be racy and lockup prone
      theoretically in other aspects without the oom reaper anyway so I didn't
      considered this a big deal.  But it seems that further changes I am
      planning in this area will benefit from stable task->mm in this path as
      well.  So let's drop find_lock_task_mm from task_will_free_mem and call
      it from under task_lock as we did previously.  Just pull the task->mm !=
      NULL check inside the function.
      
      Link: http://lkml.kernel.org/r/1467201562-6709-1-git-send-email-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      091f362c
    • Michal Hocko's avatar
      mm, oom: hide mm which is shared with kthread or global init · a373966d
      Michal Hocko authored
      
      
      The only case where the oom_reaper is not triggered for the oom victim
      is when it shares the memory with a kernel thread (aka use_mm) or with
      the global init.  After "mm, oom: skip vforked tasks from being
      selected" the victim cannot be a vforked task of the global init so we
      are left with clone(CLONE_VM) (without CLONE_SIGHAND).  use_mm() users
      are quite rare as well.
      
      In order to help forward progress for the OOM killer, make sure that
      this really rare case will not get in the way - we do this by hiding the
      mm from the oom killer by setting MMF_OOM_REAPED flag for it.
      oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
      MMF_OOM_REAPED flag set to catch these oom victims.
      
      After this patch we should guarantee forward progress for the OOM killer
      even when the selected victim is sharing memory with a kernel thread or
      global init as long as the victims mm is still alive.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-11-git-send-email-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a373966d
    • Michal Hocko's avatar
      mm, oom_reaper: do not attempt to reap a task more than twice · 11a410d5
      Michal Hocko authored
      
      
      oom_reaper relies on the mmap_sem for read to do its job.  Many places
      which might block readers have been converted to use down_write_killable
      and that has reduced chances of the contention a lot.  Some paths where
      the mmap_sem is held for write can take other locks and they might
      either be not prepared to fail due to fatal signal pending or too
      impractical to be changed.
      
      This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
      first attempt to reap a task's mm fails.  If the flag is present after
      the failure then we set MMF_OOM_REAPED to hide this mm from the oom
      killer completely so it can go and chose another victim.
      
      As a result a risk of OOM deadlock when the oom victim would be blocked
      indefinetly and so the oom killer cannot make any progress should be
      mitigated considerably while we still try really hard to perform all
      reclaim attempts and stay predictable in the behavior.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11a410d5
    • Michal Hocko's avatar
      mm, oom: task_will_free_mem should skip oom_reaped tasks · 696453e6
      Michal Hocko authored
      
      
      The 0-day robot has encountered the following:
      
         Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
         Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB
      
      oom_reaper is trying to reap the same task again and again.
      
      This is possible only when the oom killer is bypassed because of
      task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
      already set during select_bad_process.  Teach task_will_free_mem to skip
      over MMF_OOM_REAPED tasks as well because they will be unlikely to free
      anything more.
      
      Analyzed by Tetsuo Handa.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-9-git-send-email-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      696453e6
    • Michal Hocko's avatar
      mm, oom: fortify task_will_free_mem() · 1af8bb43
      Michal Hocko authored
      task_will_free_mem is rather weak.  It doesn't really tell whether the
      task has chance to drop its mm.  98748bd7
      
       ("oom: consider
      multi-threaded tasks in task_will_free_mem") made a first step into making
      it more robust for multi-threaded applications so now we know that the
      whole process is going down and probably drop the mm.
      
      This patch builds on top for more complex scenarios where mm is shared
      between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel
      use_mm().
      
      Make sure that all processes sharing the mm are killed or exiting.  This
      will allow us to replace try_oom_reaper by wake_oom_reaper because
      task_will_free_mem implies the task is reapable now.  Therefore all paths
      which bypass the oom killer are now reapable and so they shouldn't lock up
      the oom killer.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1af8bb43
    • Michal Hocko's avatar
      mm, oom: kill all tasks sharing the mm · 97fd49c2
      Michal Hocko authored
      
      
      Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
      process sharing the same mm is unkillable via OOM_ADJUST_MIN.  After "mm,
      oom_adj: make sure processes sharing mm have same view of oom_score_adj"
      all such processes are sharing the same value so we shouldn't see such a
      task at all (oom_badness would rule them out).
      
      We can still encounter oom disabled vforked task which has to be killed as
      well if we want to have other tasks sharing the mm reapable because it can
      access the memory before doing exec.  Killing such a task should be
      acceptable because it is highly unlikely it has done anything useful
      because it cannot modify any memory before it calls exec.  An alternative
      would be to keep the task alive and skip the oom reaper and risk all the
      weird corner cases where the OOM killer cannot make forward progress
      because the oom victim hung somewhere on the way to exit.
      
      [rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task
       the setting is inherently racy and we cannot do much about it without
       introducing locks in hot paths]
      Link: http://lkml.kernel.org/r/1466426628-15074-7-git-send-email-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97fd49c2
    • Michal Hocko's avatar
      mm, oom: skip vforked tasks from being selected · b18dc5f2
      Michal Hocko authored
      
      
      vforked tasks are not really sitting on any memory.  They are sharing the
      mm with parent until they exec into a new code.  Until then it is just
      pinning the address space.  OOM killer will kill the vforked task along
      with its parent but we still can end up selecting vforked task when the
      parent wouldn't be selected.  E.g.  init doing vfork to launch a task or
      vforked being a child of oom unkillable task with an updated oom_score_adj
      to be killable.
      
      Add a new helper to check whether a task is in the vfork sharing memory
      with its parent and use it in oom_badness to skip over these tasks.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b18dc5f2