Skip to content
  1. Jan 14, 2011
    • Andrea Arcangeli's avatar
      thp: update futex compound knowledge · a5b338f2
      Andrea Arcangeli authored
      
      
      Futex code is smarter than most other gup_fast O_DIRECT code and knows
      about the compound internals.  However now doing a put_page(head_page)
      will not release the pin on the tail page taken by gup-fast, leading to
      all sort of refcounting bugchecks.  Getting a stable head_page is a little
      tricky.
      
      page_head = page is there because if this is not a tail page it's also the
      page_head.  Only in case this is a tail page, compound_head is called,
      otherwise it's guaranteed unnecessary.  And if it's a tail page
      compound_head has to run atomically inside irq disabled section
      __get_user_pages_fast before returning.  Otherwise ->first_page won't be a
      stable pointer.
      
      Disableing irq before __get_user_page_fast and releasing irq after running
      compound_head is needed because if __get_user_page_fast returns == 1, it
      means the huge pmd is established and cannot go away from under us.
      pmdp_splitting_flush_notify in __split_huge_page_splitting will have to
      wait for local_irq_enable before the IPI delivery can return.  This means
      __split_huge_page_refcount can't be running from under us, and in turn
      when we run compound_head(page) we're not reading a dangling pointer from
      tailpage->first_page.  Then after we get to stable head page, we are
      always safe to call compound_lock and after taking the compound lock on
      head page we can finally re-check if the page returned by gup-fast is
      still a tail page.  in which case we're set and we didn't need to split
      the hugepage in order to take a futex on it.
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5b338f2
    • Andrea Arcangeli's avatar
      thp: put_page: recheck PageHead after releasing the compound_lock · a95a82e9
      Andrea Arcangeli authored
      
      
      After releasing the compound_lock split_huge_page can still run and release the
      page before put_page_testzero runs.
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a95a82e9
    • Andrea Arcangeli's avatar
      thp: alter compound get_page/put_page · 91807063
      Andrea Arcangeli authored
      
      
      Alter compound get_page/put_page to keep references on subpages too, in
      order to allow __split_huge_page_refcount to split an hugepage even while
      subpages have been pinned by one of the get_user_pages() variants.
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91807063
    • Andrea Arcangeli's avatar
      thp: compound_lock · e9da73d6
      Andrea Arcangeli authored
      
      
      Add a new compound_lock() needed to serialize put_page against
      __split_huge_page_refcount().
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9da73d6
    • Andrea Arcangeli's avatar
      thp: mm: define MADV_HUGEPAGE · a826e422
      Andrea Arcangeli authored
      
      
      Define MADV_HUGEPAGE.
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a826e422
    • Andrea Arcangeli's avatar
      thp: transparent hugepage support documentation · 1c9bf22c
      Andrea Arcangeli authored
      
      
      Documentation/vm/transhuge.txt
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c9bf22c
    • Andrea Arcangeli's avatar
      thp: fix bad_page to show the real reason the page is bad · 4e9f64c4
      Andrea Arcangeli authored
      
      
      page_count shows the count of the head page, but the actual check is done
      on the tail page, so show what is really being checked.
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e9f64c4
    • Hugh Dickins's avatar
      thp: ksm: free swap when swapcache page is replaced · ae52a2ad
      Hugh Dickins authored
      
      
      When a swapcache page is replaced by a ksm page, it's best to free that
      swap immediately.
      
      Reported-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae52a2ad
    • Minchan Kim's avatar
      writeback: avoid unnecessary determine_dirtyable_memory call · 240c879f
      Minchan Kim authored
      
      
      I think determine_dirtyable_memory() is a rather costly function since it
      need many atomic reads for gathering zone/global page state.  But when we
      use vm_dirty_bytes && dirty_background_bytes, we don't need that costly
      calculation.
      
      This patch eliminates such unnecessary overhead.
      
      NOTE : newly added if condition might add overhead in normal path.
             But it should be _really_ small because anyway we need the
             access both vm_dirty_bytes and dirty_background_bytes so it is
             likely to hit the cache.
      
      [akpm@linux-foundation.org: fix used-uninitialised warning]
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      240c879f
    • Volodymyr G. Lukiianyk's avatar
      mm: set correct numa_zonelist_order string when configured on the kernel command line · ecb256f8
      Volodymyr G. Lukiianyk authored
      
      
      When numa_zonelist_order parameter is set to "node" or "zone" on the
      command line it's still showing as "default" in sysctl.  That's because
      early_param parsing function changes only user_zonelist_order variable.
      Fix this by copying user-provided string to numa_zonelist_order if it was
      successfully parsed.
      
      Signed-off-by: default avatarVolodymyr G Lukiianyk <volodymyrgl@gmail.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ecb256f8
    • Mel Gorman's avatar
      mm: kswapd: use the classzone idx that kswapd was using for sleeping_prematurely() · dc83edd9
      Mel Gorman authored
      
      
      When kswapd is woken up for a high-order allocation, it takes account of
      the highest usable zone by the caller (the classzone idx).  During
      allocation, this index is used to select the lowmem_reserve[] that should
      be applied to the watermark calculation in zone_watermark_ok().
      
      When balancing a node, kswapd considers the highest unbalanced zone to be
      the classzone index.  This will always be at least be the callers
      classzone_idx and can be higher.  However, sleeping_prematurely() always
      considers the lowest zone (e.g.  ZONE_DMA) to be the classzone index.
      This means that sleeping_prematurely() can consider a zone to be balanced
      that is unusable by the allocation request that originally woke kswapd.
      This patch changes sleeping_prematurely() to use a classzone_idx matching
      the value it used in balance_pgdat().
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarEric B Munson <emunson@mgebm.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc83edd9
    • Mel Gorman's avatar
      mm: kswapd: treat zone->all_unreclaimable in sleeping_prematurely similar to balance_pgdat() · 355b09c4
      Mel Gorman authored
      
      
      After DEF_PRIORITY, balance_pgdat() considers all_unreclaimable zones to
      be balanced but sleeping_prematurely does not.  This can force kswapd to
      stay awake longer than it should.  This patch fixes it.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarEric B Munson <emunson@mgebm.net>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      355b09c4
    • Mel Gorman's avatar
      mm: kswapd: reset kswapd_max_order and classzone_idx after reading · 4d40502e
      Mel Gorman authored
      
      
      When kswapd wakes up, it reads its order and classzone from pgdat and
      calls balance_pgdat.  While its awake, it potentially reclaimes at a high
      order and a low classzone index.  This might have been a once-off that was
      not required by subsequent callers.  However, because the pgdat values
      were not reset, they remain artifically high while balance_pgdat() is
      running and potentially kswapd enters a second unnecessary reclaim cycle.
      Reset the pgdat order and classzone index after reading.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d40502e
    • Mel Gorman's avatar
      mm: kswapd: use the order that kswapd was reclaiming at for sleeping_prematurely() · 0abdee2b
      Mel Gorman authored
      
      
      Before kswapd goes to sleep, it uses sleeping_prematurely() to check if
      there was a race pushing a zone below its watermark.  If the race
      happened, it stays awake.  However, balance_pgdat() can decide to reclaim
      at order-0 if it decides that high-order reclaim is not working as
      expected.  This information is not passed back to sleeping_prematurely().
      The impact is that kswapd remains awake reclaiming pages long after it
      should have gone to sleep.  This patch passes the adjusted order to
      sleeping_prematurely and uses the same logic as balance_pgdat to decide if
      it's ok to go to sleep.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0abdee2b
    • Mel Gorman's avatar
      mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced · 1741c877
      Mel Gorman authored
      
      
      When reclaiming for high-orders, kswapd is responsible for balancing a
      node but it should not reclaim excessively.  It avoids excessive reclaim
      by considering if any zone in a node is balanced then the node is
      balanced.  In the cases where there are imbalanced zone sizes (e.g.
      ZONE_DMA with both ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep
      prematurely as just one small zone was balanced.
      
      This alters the sleep logic of kswapd slightly.  It counts the number of
      pages that make up the balanced zones.  If the total number of balanced
      pages is more than a quarter of the zone, kswapd will go back to sleep.
      This should keep a node balanced without reclaiming an excessive number of
      pages.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1741c877
    • Mel Gorman's avatar
      mm: kswapd: stop high-order balancing when any suitable zone is balanced · 99504748
      Mel Gorman authored
      
      
      Simon Kirby reported the following problem
      
         We're seeing cases on a number of servers where cache never fully
         grows to use all available memory.  Sometimes we see servers with 4 GB
         of memory that never seem to have less than 1.5 GB free, even with a
         constantly-active VM.  In some cases, these servers also swap out while
         this happens, even though they are constantly reading the working set
         into memory.  We have been seeing this happening for a long time; I
         don't think it's anything recent, and it still happens on 2.6.36.
      
      After some debugging work by Simon, Dave Hansen and others, the prevaling
      theory became that kswapd is reclaiming order-3 pages requested by SLUB
      too aggressive about it.
      
      There are two apparent problems here.  On the target machine, there is a
      small Normal zone in comparison to DMA32.  As kswapd tries to balance all
      zones, it would continually try reclaiming for Normal even though DMA32
      was balanced enough for callers.  The second problem is that
      sleeping_prematurely() does not use the same logic as balance_pgdat() when
      deciding whether to sleep or not.  This keeps kswapd artifically awake.
      
      A number of tests were run and the figures from previous postings will
      look very different for a few reasons.  One, the old figures were forcing
      my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
       Second, I previous specified slub_min_order=3 again in an attempt to
      reproduce Simon's problem.  In this posting, I'm depending on Simon to say
      whether his problem is fixed or not and these figures are to show the
      impact to the ordinary cases.  Finally, the "vmscan" figures are taken
      from /proc/vmstat instead of the tracepoints.  There is less information
      but recording is less disruptive.
      
      The first test of relevance was postmark with a process running in the
      background reading a large amount of anonymous memory in blocks.  The
      objective was to vaguely simulate what was happening on Simon's machine
      and it's memory intensive enough to have kswapd awake.
      
      POSTMARK
                                                  traceonly          kanyzone
      Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
      Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
      Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
      Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
      Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
      Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
      Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
      
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         16.58      17.4
      Total Elapsed Time (seconds)                218.48    222.47
      
      VMstat Reclaim Statistics: vmscan
      Direct reclaims                                  0          4
      Direct reclaim pages scanned                     0        203
      Direct reclaim pages reclaimed                   0        184
      Kswapd pages scanned                        326631     322018
      Kswapd pages reclaimed                      312632     309784
      Kswapd low wmark quickly                         1          4
      Kswapd high wmark quickly                      122        475
      Kswapd skip congestion_wait                      1          0
      Pages activated                             700040     705317
      Pages deactivated                           212113     203922
      Pages written                                 9875       6363
      
      Total pages scanned                         326631    322221
      Total pages reclaimed                       312632    309968
      %age total pages scanned/reclaimed          95.71%    96.20%
      %age total pages scanned/written             3.02%     1.97%
      
      proc vmstat: Faults
      Major Faults                                   300       254
      Minor Faults                                645183    660284
      Page ins                                    493588    486704
      Page outs                                  4960088   4986704
      Swap ins                                      1230       661
      Swap outs                                     9869      6355
      
      Performance is mildly affected because kswapd is no longer doing as much
      work and the background memory consumer process is getting in the way.
      Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
      and overall fewer pages were scanned and reclaimed.  Swap in/out is
      particularly reduced again reflecting kswapd throwing out fewer pages.
      
      The slight performance impact is unfortunate here but it looks like a
      direct result of kswapd being less aggressive.  As the bug report is about
      too many pages being freed by kswapd, it may have to be accepted for now.
      
      The second test is a streaming IO benchmark that was previously used by
      Johannes to show regressions in page reclaim.
      
      MICRO
      					 traceonly  kanyzone
      User/Sys Time Running Test (seconds)         29.29     28.87
      Total Elapsed Time (seconds)                492.18    488.79
      
      VMstat Reclaim Statistics: vmscan
      Direct reclaims                               2128       1460
      Direct reclaim pages scanned               2284822    1496067
      Direct reclaim pages reclaimed              148919     110937
      Kswapd pages scanned                      15450014   16202876
      Kswapd pages reclaimed                     8503697    8537897
      Kswapd low wmark quickly                      3100       3397
      Kswapd high wmark quickly                     1860       7243
      Kswapd skip congestion_wait                    708        801
      Pages activated                               9635       9573
      Pages deactivated                             1432       1271
      Pages written                                  223       1130
      
      Total pages scanned                       17734836  17698943
      Total pages reclaimed                      8652616   8648834
      %age total pages scanned/reclaimed          48.79%    48.87%
      %age total pages scanned/written             0.00%     0.01%
      
      proc vmstat: Faults
      Major Faults                                   165       221
      Minor Faults                               9655785   9656506
      Page ins                                      3880      7228
      Page outs                                 37692940  37480076
      Swap ins                                         0        69
      Swap outs                                       19        15
      
      Again fewer pages are scanned and reclaimed as expected and this time the
      test completed faster.  Note that kswapd is hitting its watermarks faster
      (low and high wmark quickly) which I expect is due to kswapd reclaiming
      fewer pages.
      
      I also ran fs-mark, iozone and sysbench but there is nothing interesting
      to report in the figures.  Performance is not significantly changed and
      the reclaim statistics look reasonable.
      
      Tgis patch:
      
      When the allocator enters its slow path, kswapd is woken up to balance the
      node.  It continues working until all zones within the node are balanced.
      For order-0 allocations, this makes perfect sense but for higher orders it
      can have unintended side-effects.  If the zone sizes are imbalanced,
      kswapd may reclaim heavily within a smaller zone discarding an excessive
      number of pages.  The user-visible behaviour is that kswapd is awake and
      reclaiming even though plenty of pages are free from a suitable zone.
      
      This patch alters the "balance" logic for high-order reclaim allowing
      kswapd to stop if any suitable zone becomes balanced to reduce the number
      of pages it reclaims from other zones.  kswapd still tries to ensure that
      order-0 watermarks for all zones are met before sleeping.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99504748
    • Steven Rostedt's avatar
      mm: remove likely() from grab_cache_page_write_begin() · c585a267
      Steven Rostedt authored
      
      
      Running the annotated branch profiler on a box doing average work
      (firefox, evolution, xchat, distcc farm), the likely() used in
      grab_cache_page_write_begin() was incorrect most of the time:
      
       correct incorrect  %        Function                  File              Line
       ------- ---------  -        --------                  ----              ----
       1924262 71332401  97 grab_cache_page_write_begin    filemap.c           2206
      
      Adding a trace_printk() and running the function tracer limited to
      just this function I can see:
      
              gconfd-2-2696  [000]  4467.268935: grab_cache_page_write_begin: page=          (null) mapping=ffff8800676a9460 index=7
              gconfd-2-2696  [000]  4467.268946: grab_cache_page_write_begin <-ext3_write_begin
              gconfd-2-2696  [000]  4467.268947: grab_cache_page_write_begin: page=          (null) mapping=ffff8800676a9460 index=8
              gconfd-2-2696  [000]  4467.268959: grab_cache_page_write_begin <-ext3_write_begin
              gconfd-2-2696  [000]  4467.268960: grab_cache_page_write_begin: page=          (null) mapping=ffff8800676a9460 index=9
              gconfd-2-2696  [000]  4467.268972: grab_cache_page_write_begin <-ext3_write_begin
              gconfd-2-2696  [000]  4467.268973: grab_cache_page_write_begin: page=          (null) mapping=ffff8800676a9460 index=10
              gconfd-2-2696  [000]  4467.268991: grab_cache_page_write_begin <-ext3_write_begin
              gconfd-2-2696  [000]  4467.268992: grab_cache_page_write_begin: page=          (null) mapping=ffff8800676a9460 index=11
              gconfd-2-2696  [000]  4467.269005: grab_cache_page_write_begin <-ext3_write_begin
      
      Which shows that a lot of calls from ext3_write_begin will result in the
      page returned by "find_lock_page" will be NULL.
      
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Acked-by: default avatarNick Piggin <npiggin@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c585a267
    • Steven Rostedt's avatar
      mm: remove unlikely() from page_mapping() · e20e8779
      Steven Rostedt authored
      
      
      page_mapping() has a unlikely that the mapping has PAGE_MAPPING_ANON set.
      But running the annotated branch profiler on a normal desktop system doing
      vairous tasks (xchat, evolution, firefox, distcc), it is not really that
      unlikely that the mapping here will have the PAGE_MAPPING_ANON flag set:
      
       correct incorrect  %        Function                  File              Line
       ------- ---------  -        --------                  ----              ----
      35935762 1270265395  97 page_mapping                   mm.h                 659
      1306198001   143659   0 page_mapping                   mm.h                 657
      203131478   121586   0 page_mapping                   mm.h                 657
       5415491     1116   0 page_mapping                   mm.h                 657
      74899487     1116   0 page_mapping                   mm.h                 657
      203132845      224   0 page_mapping                   mm.h                 659
       5415464       27   0 page_mapping                   mm.h                 659
         13552        0   0 page_mapping                   mm.h                 657
         13552        0   0 page_mapping                   mm.h                 659
        242630        0   0 page_mapping                   mm.h                 657
        242630        0   0 page_mapping                   mm.h                 659
      74899487        0   0 page_mapping                   mm.h                 659
      
      The page_mapping() is a static inline, which is why it shows up multiple
      times.
      
      The unlikely in page_mapping() was correct a total of 1909540379 times and
      incorrect 1270533123 times, with a 39% being incorrect.  With this much of
      an error, it's best to simply remove the unlikely and have the compiler
      and branch prediction figure this out.
      
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e20e8779
    • Steven Rostedt's avatar
      mm: remove likely() from mapping_unevictable() · 088e5465
      Steven Rostedt authored
      
      
      The mapping_unevictable() has a likely() around the mapping parameter.
      This mapping parameter comes from page_mapping() which has an unlikely()
      that the page will be set as PAGE_MAPPING_ANON, and if so, it will return
      NULL.  One would think that this unlikely() means that the mapping
      returned by page_mapping() would not be NULL, but where page_mapping() is
      used just above mapping_unevictable(), that unlikely() is incorrect most
      of the time.  This means that the "likely(mapping)" in
      mapping_unevictable() is incorrect most of the time.
      
      Running the annotated branch profiler on my main box which runs firefox,
      evolution, xchat and is part of my distcc farm, I had this:
      
       correct incorrect  %        Function                  File              Line
       ------- ---------  -        --------                  ----              ----
      12872836 1269443893  98 mapping_unevictable            pagemap.h            51
      35935762 1270265395  97 page_mapping                   mm.h                 659
      1306198001   143659   0 page_mapping                   mm.h                 657
      203131478   121586   0 page_mapping                   mm.h                 657
       5415491     1116   0 page_mapping                   mm.h                 657
      74899487     1116   0 page_mapping                   mm.h                 657
      203132845      224   0 page_mapping                   mm.h                 659
       5415464       27   0 page_mapping                   mm.h                 659
         13552        0   0 page_mapping                   mm.h                 657
         13552        0   0 page_mapping                   mm.h                 659
        242630        0   0 page_mapping                   mm.h                 657
        242630        0   0 page_mapping                   mm.h                 659
      74899487        0   0 page_mapping                   mm.h                 659
      
      The page_mapping() is a static inline, which is why it shows up multiple
      times.  The mapping_unevictable() is also a static inline but seems to be
      used only once in my setup.
      
      The unlikely in page_mapping() was correct a total of 1909540379 times and
      incorrect 1270533123 times, with a 39% being incorrect.  Perhaps this is
      enough to remove the unlikely from page_mapping() as well.
      
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarNick Piggin <npiggin@kernel.dk>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      088e5465
    • Tobias Klauser's avatar
      vmalloc: remove redundant unlikely() · ddf9c6d4
      Tobias Klauser authored
      
      
      IS_ERR() already implies unlikely(), so it can be omitted here.
      
      Signed-off-by: default avatarTobias Klauser <tklauser@distanz.ch>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ddf9c6d4
    • KOSAKI Motohiro's avatar
      mempolicy: remove tasklist_lock from migrate_pages · 1e50df39
      KOSAKI Motohiro authored
      
      
      Today, tasklist_lock in migrate_pages doesn't protect anything.
      rcu_read_lock() provide enough protection from pid hash walk.
      
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reported-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e50df39
    • Michel Lespinasse's avatar
      mlock: do not hold mmap_sem for extended periods of time · 53a7706d
      Michel Lespinasse authored
      
      
      __get_user_pages gets a new 'nonblocking' parameter to signal that the
      caller is prepared to re-acquire mmap_sem and retry the operation if
      needed.  This is used to split off long operations if they are going to
      block on a disk transfer, or when we detect contention on the mmap_sem.
      
      [akpm@linux-foundation.org: remove ref to rwsem_is_contended()]
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53a7706d
    • Michel Lespinasse's avatar
      mm: move VM_LOCKED check to __mlock_vma_pages_range() · 5fdb2002
      Michel Lespinasse authored
      
      
      Use a single code path for faulting in pages during mlock.
      
      The reason to have it in this patch series is that I did not want to
      update both code paths in a later change that releases mmap_sem when
      blocking on disk.
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5fdb2002
    • Michel Lespinasse's avatar
      mm: add FOLL_MLOCK follow_page flag. · 110d74a9
      Michel Lespinasse authored
      
      
      Move the code to mlock pages from __mlock_vma_pages_range() to
      follow_page().
      
      This allows __mlock_vma_pages_range() to not have to break down work into
      16-page batches.
      
      An additional motivation for doing this within the present patch series is
      that it'll make it easier for a later chagne to drop mmap_sem when
      blocking on disk (we'd like to be able to resume at the page that was read
      from disk instead of at the start of a 16-page batch).
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      110d74a9
    • Michel Lespinasse's avatar
      mlock: only hold mmap_sem in shared mode when faulting in pages · fed067da
      Michel Lespinasse authored
      
      
      Currently mlock() holds mmap_sem in exclusive mode while the pages get
      faulted in.  In the case of a large mlock, this can potentially take a
      very long time, during which various commands such as 'ps auxw' will
      block.  This makes sysadmins unhappy:
      
      real    14m36.232s
      user    0m0.003s
      sys     0m0.015s
      (output from 'time ps auxw' while a 20GB file was being mlocked without
      being previously preloaded into page cache)
      
      I propose that mlock() could release mmap_sem after the VM_LOCKED bits
      have been set in all appropriate VMAs.  Then a second pass could be done
      to actually mlock the pages, in small batches, releasing mmap_sem when we
      block on disk access or when we detect some contention.
      
      This patch:
      
      Before this change, mlock() holds mmap_sem in exclusive mode while the
      pages get faulted in.  In the case of a large mlock, this can potentially
      take a very long time.  Various things will block while mmap_sem is held,
      including 'ps auxw'.  This can make sysadmins angry.
      
      I propose that mlock() could release mmap_sem after the VM_LOCKED bits
      have been set in all appropriate VMAs.  Then a second pass could be done
      to actually mlock the pages with mmap_sem held for reads only.  We need to
      recheck the vma flags after we re-acquire mmap_sem, but this is easy.
      
      In the case where a vma has been munlocked before mlock completes, pages
      that were already marked as PageMlocked() are handled by the munlock()
      call, and mlock() is careful to not mark new page batches as PageMlocked()
      after the munlock() call has cleared the VM_LOCKED vma flags.  So, the end
      result will be identical to what'd happen if munlock() had executed after
      the mlock() call.
      
      In a later change, I will allow the second pass to release mmap_sem when
      blocking on disk accesses or when it is otherwise contended, so that it
      won't be held for long periods of time even in shared mode.
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Tested-by: default avatarValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fed067da
    • Michel Lespinasse's avatar
      mlock: avoid dirtying pages and triggering writeback · 5ecfda04
      Michel Lespinasse authored
      
      
      When faulting in pages for mlock(), we want to break COW for anonymous or
      file pages within VM_WRITABLE, non-VM_SHARED vmas.  However, there is no
      need to write-fault into VM_SHARED vmas since shared file pages can be
      mlocked first and dirtied later, when/if they actually get written to.
      Skipping the write fault is desirable, as we don't want to unnecessarily
      cause these pages to be dirtied and queued for writeback.
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Theodore Tso <tytso@google.com>
      Cc: Michael Rubin <mrubin@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ecfda04
    • Michel Lespinasse's avatar
      do_wp_page: clarify dirty_page handling · 72ddc8f7
      Michel Lespinasse authored
      
      
      Reorganize the code so that dirty pages are handled closer to the place
      that makes them dirty (handling write fault into shared, writable VMAs).
      No behavior changes.
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Theodore Tso <tytso@google.com>
      Cc: Michael Rubin <mrubin@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72ddc8f7
    • Michel Lespinasse's avatar
      do_wp_page: remove the 'reuse' flag · b009c024
      Michel Lespinasse authored
      
      
      mlocking a shared, writable vma currently causes the corresponding pages
      to be marked as dirty and queued for writeback.  This seems rather
      unnecessary given that the pages are not being actually modified during
      mlock.  It is understood that for non-shared mappings (file or anon) we
      want to use a write fault in order to break COW, but there is just no such
      need for shared mappings.
      
      The first two patches in this series do not introduce any behavior change.
       The intent there is to make it obvious that dirtying file pages is only
      done in the (writable, shared) case.  I think this clarifies the code, but
      I wouldn't mind dropping these two patches if there is no consensus about
      them.
      
      The last patch is where we actually avoid dirtying shared mappings during
      mlock.  Note that as a side effect of this, we won't call page_mkwrite()
      for the mappings that define it, and won't be pre-allocating data blocks
      at the FS level if the mapped file was sparsely allocated.  My
      understanding is that mlock does not need to provide such guarantee, as
      evidenced by the fact that it never did for the filesystems that don't
      define page_mkwrite() - including some common ones like ext3.  However, I
      would like to gather feedback on this from filesystem people as a
      precaution.  If this turns out to be a showstopper, maybe block
      preallocation can be added back on using a different interface.
      
      Large shared mlocks are getting significantly (>2x) faster in my tests, as
      the disk can be fully used for reading the file instead of having to share
      between this and writeback.
      
      This patch:
      
      Reorganize the code to remove the 'reuse' flag.  No behavior changes.
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Theodore Tso <tytso@google.com>
      Cc: Michael Rubin <mrubin@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b009c024
    • Rik van Riel's avatar
      mm: clear PageError bit in msync & fsync · 212260aa
      Rik van Riel authored
      
      
      Temporary IO failures, eg.  due to loss of both multipath paths, can
      permanently leave the PageError bit set on a page, resulting in msync or
      fsync returning -EIO over and over again, even if IO is now getting to the
      disk correctly.
      
      We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in the
      filemap_fdatawait_range function.  Also clearing the PageError bit on the
      page allows subsequent msync or fsync calls on this file to return without
      an error, if the subsequent IO succeeds.
      
      Unfortunately data written out in the msync or fsync call that returned
      -EIO can still get lost, because the page dirty bit appears to not get
      restored on IO error.  However, the alternative could be potentially all
      of memory filling up with uncleanable dirty pages, hanging the system, so
      there is no nice choice here...
      
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarValerie Aurora <vaurora@redhat.com>
      Acked-by: default avatarJeff Layton <jlayton@redhat.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      212260aa
    • Mandeep Singh Baines's avatar
      oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down · dabb16f6
      Mandeep Singh Baines authored
      
      
      We'd like to be able to oom_score_adj a process up/down as it
      enters/leaves the foreground.  Currently, it is not possible to oom_adj
      down without CAP_SYS_RESOURCE.  This patch allows a task to decrease its
      oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
      or its inherited value at fork.  Assuming the thread that has forked it
      has oom_score_adj of 0, each process could decrease it back from 0 upon
      activation unless a CAP_SYS_RESOURCE thread elevated it to something
      higher.
      
      Alternative considered:
      
      * a setuid binary
      * a daemon with CAP_SYS_RESOURCE
      
      Since you don't wan't all processes to be able to reduce their oom_adj, a
      setuid or daemon implementation would be complex.  The alternatives also
      have much higher overhead.
      
      This patch updated from original patch based on feedback from David
      Rientjes.
      
      Signed-off-by: default avatarMandeep Singh Baines <msb@chromium.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dabb16f6
    • David Rientjes's avatar
      mm: unify module_alloc code for vmalloc · d0a21265
      David Rientjes authored
      
      
      Four architectures (arm, mips, sparc, x86) use __vmalloc_area() for
      module_init().  Much of the code is duplicated and can be generalized in a
      globally accessible function, __vmalloc_node_range().
      
      __vmalloc_node() now calls into __vmalloc_node_range() with a range of
      [VMALLOC_START, VMALLOC_END) for functionally equivalent behavior.
      
      Each architecture may then use __vmalloc_node_range() directly to remove
      the duplication of code.
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0a21265
    • David Rientjes's avatar
      mm: remove gfp mask from pcpu_get_vm_areas · ec3f64fc
      David Rientjes authored
      
      
      pcpu_get_vm_areas() only uses GFP_KERNEL allocations, so remove the gfp_t
      formal and use the mask internally.
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec3f64fc
    • David Rientjes's avatar
      mm: remove unused get_vm_area_node · e5a5623b
      David Rientjes authored
      
      
      get_vm_area_node() is unused in the kernel and can thus be removed.
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5a5623b
    • Mel Gorman's avatar
      mm: vmscan: rename lumpy_mode to reclaim_mode · f3a310bc
      Mel Gorman authored
      
      
      With compaction being used instead of lumpy reclaim, the name lumpy_mode
      and associated variables is a bit misleading.  Rename lumpy_mode to
      reclaim_mode which is a better fit.  There is no functional change.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3a310bc
    • Mel Gorman's avatar
      mm: compaction: perform a faster migration scan when migrating asynchronously · 9927af74
      Mel Gorman authored
      
      
      try_to_compact_pages() is initially called to only migrate pages
      asychronously and kswapd always compacts asynchronously.  Both are being
      optimistic so it is important to complete the work as quickly as possible
      to minimise stalls.
      
      This patch alters the scanner when asynchronous to only consider
      MIGRATE_MOVABLE pageblocks as migration candidates.  This reduces stalls
      when allocating huge pages while not impairing allocation success rates as
      a full scan will be performed if necessary after direct reclaim.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9927af74
    • Mel Gorman's avatar
      mm: migration: cleanup migrate_pages API by matching types for offlining and sync · 7f0f2496
      Mel Gorman authored
      
      
      With the introduction of the boolean sync parameter, the API looks a
      little inconsistent as offlining is still an int.  Convert offlining to a
      bool for the sake of being tidy.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f0f2496
    • Mel Gorman's avatar
      mm: migration: allow migration to operate asynchronously and avoid synchronous... · 77f1fe6b
      Mel Gorman authored
      
      mm: migration: allow migration to operate asynchronously and avoid synchronous compaction in the faster path
      
      Migration synchronously waits for writeback if the initial passes fails.
      Callers of memory compaction do not necessarily want this behaviour if the
      caller is latency sensitive or expects that synchronous migration is not
      going to have a significantly better success rate.
      
      This patch adds a sync parameter to migrate_pages() allowing the caller to
      indicate if wait_on_page_writeback() is allowed within migration or not.
      For reclaim/compaction, try_to_compact_pages() is first called
      asynchronously, direct reclaim runs and then try_to_compact_pages() is
      called synchronously as there is a greater expectation that it'll succeed.
      
      [akpm@linux-foundation.org: build/merge fix]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77f1fe6b
    • Mel Gorman's avatar
      mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaim · 3e7d3449
      Mel Gorman authored
      
      
      Lumpy reclaim is disruptive.  It reclaims a large number of pages and
      ignores the age of the pages it reclaims.  This can incur significant
      stalls and potentially increase the number of major faults.
      
      Compaction has reached the point where it is considered reasonably stable
      (meaning it has passed a lot of testing) and is a potential candidate for
      displacing lumpy reclaim.  This patch introduces an alternative to lumpy
      reclaim whe compaction is available called reclaim/compaction.  The basic
      operation is very simple - instead of selecting a contiguous range of
      pages to reclaim, a number of order-0 pages are reclaimed and then
      compaction is later by either kswapd (compact_zone_order()) or direct
      compaction (__alloc_pages_direct_compact()).
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: use conventional task_struct naming]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e7d3449
    • Mel Gorman's avatar
      mm: vmscan: convert lumpy_mode into a bitmask · ee64fc93
      Mel Gorman authored
      
      
      Currently lumpy_mode is an enum and determines if lumpy reclaim is off,
      syncronous or asyncronous.  In preparation for using compaction instead of
      lumpy reclaim, this patch converts the flags into a bitmap.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee64fc93
    • Mel Gorman's avatar
      mm: compaction: add trace events for memory compaction activity · b7aba698
      Mel Gorman authored
      
      
      In preparation for a patches promoting the use of memory compaction over
      lumpy reclaim, this patch adds trace points for memory compaction
      activity.  Using them, we can monitor the scanning activity of the
      migration and free page scanners as well as the number and success rates
      of pages passed to page migration.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b7aba698